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Foreword 



In 2002, the International Conference on Computer Aided Design (ICCAD) 
celebrates its 20th anniversary. This book commemorates contributions made by 
ICCAD to the broad field of design automation during that time. The foundation 
of ICCAD in 1982 coincided with the growth of Large Scale Integration. The 
sharply increased functionality of board-level circuits led to a major demand 
for more powerful Electronic Design Automation (EDA) tools. At the same 
time, LSI grew quickly and advanced circuit integration became widely avail- 
able. This, in turn, required new tools, using sophisticated modeling, analysis 
and optimization algorithms in order to manage the evermore complex design 
processes. Not surprisingly, during the same period, a number of start-up com- 
panies began to commercialize EDA solutions, complementing various existing 
in-house efforts. The overall increased interest in Design Automation (DA) re- 
quired a new forum for the emerging community of EDA professionals; one 
which would be focused on the publication of high-quality research results and 
provide a structure for the exchange of ideas on a broad scale. 

Many of the original ICCAD volunteers were also members of CANDE 
(Computer-Aided Network Design), a workshop of the IEEE Circuits and Sys- 
tem Society. In fact, it was at a CANDE workshop that Bill McCalla suggested 
the creation of a conference for the EDA professional. (Bill later developed the 
name). Throughout the years, CANDE has provided an active meeting place for 
its members, disclosing and discussing important new developments, but lack- 
ing a wide forum for formal paper publications and general participation. To 
address this need in turn, two conferences were founded - ICCAD and the Inter- 
national Conference on Computer Design (ICCD). ICCAD was largely focused 
on the algorithmic core of CAD, whereas ICCD was mainly oriented to design 
and CAD applications. 

From its inception, ICCAD was structured as a high-quality conference with 
a superb review process, thereby emphasizing original contributions with sig- 
nificant theoretical and practical impact. The technical program and executive 
committees were based on a rotating involvement of key experts from differ- 
ent CAD areas from Asia, Europe, and the American Continent and provided 
the backbone for the international and multi-disciplinary character of ICCAD. 
From the beginning, the conference home was established in the heart of Silicon 
Valley, which was one of the main centers of many emerging CAD activities 
and entrepreneurs and close to key commercial and university research groups. 
In fact, an early guiding light for the Conference was Professor Don Pederson 
from UC Berkeley. ICCAD was initially co-sponsored by the IEEE Circuits and 
Systems Society and the IEEE Computer Society. In 1992, the ACM Special 
Interest Group on Design Automation joined as a co-sponsor. 
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After its foundation, ICCAD quickly became a core conference in the CAD 
area and a prime choice for high-quality paper publications. The resulting large 
number of manuscript submissions combined with a stringent selection process 
resulted in a high standard for its papers. Furthermore, ICCAD’s predictable 
presence in Silicon Valley in November and its primary focus on scientific ex- 
change made ICCAD a key event for EDA novices and experts alike. It became a 
place where students, colleagues and friends from industry and academia would 
meet, discuss their work and learn from each other. Soon, regular peripheral 
meetings and events were organized around the ICCAD schedule, reflecting the 
importance of the conference to the EDA community. 

Over 2,200 papers have been published at ICCAD during the past 20 years, 
many of them presenting important innovations that have made their way into 
products or resulted in a long trail of rich follow-up research. For example, many 
original publications in automatic layout generation first appeared in ICCAD. 
Similarly, much of the early work on logic synthesis and optimization was pre- 
sented at ICCAD. Numerous other landmark papers in physical design, syn- 
thesis, circuit and system design and analysis, functional verification, statistical 
modeling and optimization as well as digital testing, timing and manufacturing 
analysis have been published in ICCAD, a selection of which are presented in 
this book. 



John J. Golembeski, Chair 1st ICCAD (1983) 

Paul B. Weil, Chair 3rd ICCAD (1985) 

Ian E. Getreu, Chair 4th ICCAD (1986) 

In memory of William J. McCalla, Chair 2nd ICCAD (1984) 



Preface 



About This Book 

The year 2002 marks the 20th anniversary of the International Conference on 
Computer Aided Design. We decided to celebrate this event by re-publishing 
a selection of papers from among the best contributions presented in ICCAD 
based on their impact on research and applications. In addition to papers, a set 
of overview articles were solicited from leading researchers to comment on the 
historical context of the selected papers and to outline their impact on follow- 
up work. Furthermore, nine sponsoring companies, which have been actively 
involved in ICCAD, contributed special articles outlining the impact of ICCAD 
on their businesses. 

The selection process for the papers was initiated with a period of public 
nominations during which the EDA community was invited to suggest landmark 
ICCAD publications to be included in this collection. Following this period, a 
selection committee completed the list of candidate papers, divided them into 
seven topic areas, reviewed the papers in corresponding subcommittees, and 
made the final selection based on a ranking and voting procedure. The com- 
mittee was composed of key program committee members from all of the past 
ICCAD events, and was chaired by an international group of four leading EDA 
researchers. 

During the public nomination phase we received 216 entries from 65 nomi- 
nees. After the selection committee nominated additional papers, the candidate 
pool included a total of 141 distinct papers representing approximately 6.4 % of 
all papers published during the past 20 years of ICCAD. From these candidates 
42 papers were selected for this collection. Clearly, this set is only a sample of 
excellent papers published in ICCAD, since many other important contributions 
could not be included due to space limitations. 

The structure of the book reflects the partitioning of the selection process into 
topic areas. The papers of the individual areas are grouped into separate book 
parts which are introduced by corresponding overview articles. A special part 
on “Industry Viewpoints” includes the articles from the sponsoring companies. 
In addition to a subject and author index, a reference index lists the authors of 
papers referenced in the overview and industry articles at the end of the book. 

The work on this project became an insightful reminder of how much progress 
has been made in EDA during the past 20 years and how many excellent con- 
tributions were published in ICCAD. We hope that the reader enjoys brows- 
ing through this historic collection and perceives the book as encouragement to 
crack the next wave of emerging EDA problems. 



The ICCAD 2002 Executive Committee 
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Abstract 

Formal hardware verification ranges from proving that two combinational circuits compute the 
same functions to the much more ambitious task of proving that a sequential circuit obeys some 
abstract property expressed in temporal logic. In tracing the history of work in this area, we 
find a few efforts in the 1970s and 1980s, with a big increase in verification capabilities the late 
1980s up through today. The advent of efficient Boolean inference methods, starting with Binary 
Decision Diagrams (BDDs) and more recently with efficient Boolean satisfiability (SAT) checkers 
has provided the enabling technology for these advances. 



1. Introduction 

Functional hardware verification involves determining whether or not a logic 
design matches a specification of its intended behavior. Most commonly, the 
design consists of a combinational or sequential logic gate circuit (possibly de- 
rived from an RTL description), and so the analysis can be performed purely 
at the Boolean level. Furthermore, the circuit is generally assumed to be either 
fully combinational or fully synchronous, and hence the functionality can be 
verified without any consideration of the circuit timing. 

In many applications, the specification is also given as a logic circuit. This 
form of verification is referred to as equivalence checking. For example, a de- 
signer might want to verify that some optimizations to a netlist did not alter its 
functionality. Even when verifying that a gate-level netlist implements a speci- 
fication given in a hardware description language such as Verilog or VHDL, the 
first step is typically to expand the HDL description into gate-level form and 
then use this as the specification. 

Equivalence checking can be further categorized as combinational or sequen- 
tial. With combinational equivalence checking, the two circuits are acyclic, 
gate-level circuits, and the task is to determine whether they compute the same 
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Boolean functions. Note that combinational equivalence can be used to prove 
the equivalence of two sequential circuits, as long as these circuits use the same 
encodings of their states. In fact, the commercial equivalence checkers now in 
widespread use follow this approach. With sequential equivalence, we are given 
two sequential circuits that could be using totally different state encodings. The 
task is to determine whether the two circuits would ever differ in their output 
values when they are started in their respective initial states and run with some 
arbitrary input sequence. 

Historically, and even to this day, the most common approach to functional 
verification is to perform extensive simulation. For equivalence checking, this 
simply involves simulating the two circuits over many patterns and seeing 
whether they ever produce different values. In principle, combinational circuits 
could be fully verified by this means, if we were willing to run enough simu- 
lation (typically exponential in the number of primary inputs). For sequential 
equivalence checking, there is no practical bound on the amount of simulation 
required to prove the two circuits will have identical behavior for all possible in- 
put sequences. In this commentary, we will focus instead on formal verification, 
where mathematically based techniques are used to prove equivalence or other 
properties for all possible input sequences. 

Going beyond equivalence checking, a more ambitious task is to prove that a 
circuit satisfies some general requirement, such as that there should never be a 
deadlock, or that any bus request will eventually be granted. The most widely 
studied class of tools for this form of verification are model checkers, where 
the program determines (checks) whether the circuit (model) obeys a property 
given as a formula in some type of temporal logic. Such formulas can express 
properties that involve the behavior of the system over time, cases where we use 
English words such as “always” and “eventually.” 

2. Early Work in Verification 

Formal hardware design verification appears to have developed in the 1970’s 
from earlier work in hardware testing and in software verification [23]. Testing- 
based approaches were applied to equivalence checking. Roth initially pro- 
posed [59] unrolling sequential logic to perform bounded sequential equivalence 
checking. Later [60] he introduced the assumption of a tight correspondence be- 
tween the registers of the circuits to be compared, thus reducing the sequential 
problem to a combinational one. Kawato et al. [40] developed similar equiva- 
lence checking methods around the same time. 

Early formal hardware verification methods based on software methods [29] 
included symbolic simulation [66, 18, 24], user-defined inductive invariants [56] 
and inductive invariants derived from assertions [50]. These early software- 
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based methods generally required user interaction to guide expression simplifi- 
cation. 

Automated formal methods continued to be developed [32, 55, 20], but be- 
fore efficient BDD algorithms were introduced most work was based on impov- 
erished data representations (e.g., sum-of-products), or inefficient search rou- 
tines (variants of the DPLL [31, 30] method used for Boolean satisfiability). 
The main problem with these approaches was that they did not exploit common- 
ality of structure. Consequently, either the data representations would blow up, 
or the search routines would run too long. 

A notable exception is work at IBM on equivalence checking. They devel- 
oped programs for internal use that could handle very large combinational cir- 
cuits. 



2.1 Early Equivalence Checking at IBM 

An algorithmic breakthrough was achieved at IBM with the Differential 
Boolean Analyzer using Equivalent Sets of Partials (DBA/ESP) [64]. This algo- 
rithm provided the satisfiability engine for the equivalence checking tool which 
was widely used at IBM in the late 1970’s through the 1980’s. DBA/ESP was 
inspired by Shannon decomposition and also the method of bifurcations given in 
Hammer and Rudeanu’s book [37], the practical potential of which was recog- 
nized by A1 Brown. DBA is also closely related to the Binary Decision Diagrams 
introduced by Akers [2]. 

DBA proceeds by successive elimination of variables using Shannon decom- 
position. The key advances that made DBA practical were effective variable 
ordering heuristics and the ESP feature suggested by Gordon Smith, which de- 
tected shared subproblems so that redundant analysis could be avoided. Variable 
ordering was inspired by the “longest equation” and “most frequent unknown” 
heuristics of Hammer and Rudeanu. The discovery of shared subproblems was 
made feasible by a representation that allowed hashing and efficient structural 
isomorphism checking. Shannon decomposition by itself would generate a bi- 
nary tree, but common subproblem recognition transforms the tree into a di- 
rected acyclic graph, in particular a BDD. Since common subproblem recogni- 
tion is done before the subproblem is analyzed and based on structural isomor- 
phism rather than functional equivalence, the resulting BDD is not fully reduced. 
However, satisfiability of the initial problem is immediately determined once the 
diagram construction is complete or a 1 terminal is reached. 

The IBM equivalence checker also provided means to indicate correspon- 
dence between internal combinational signals, reducing the complexity of the 
problems that the DBA/ESP algorithm needed to solve. The possibility of false 
negatives due to cutting at internal equivalence points was noted, and the use of 
a manually specified don’t care signal to circumvent the problem is described. 
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Don’t care signals could also be used to avoid false negatives due to unreach- 
able states, with support for validating the unreachable state assertion during 
simulation. 



2.2 Binary Decision Diagrams 

The idea of encoding a Boolean function as a graph of decisions was first 
proposed by Akers [2], based on an earlier encoding as a straight-line program 
by Lee[44]. Akers coined the term “Binary Decision Diagram” (BDD) and also 
explored some of their properties, but he did not provide an efficient method 
for building BDDs from circuits. Akers’ default strategy would be exhaustive 
simulation to build a complete binary tree, followed by reduction operations to 
exploit subtree sharing. 

In late 1983, Bryant was inspired by the way concurrent fault simulators eval- 
uate the gates in a circuit for both the good and many faulty behaviors in a 
single pass through the circuit, using lists to encode the multiple different val- 
ues at each gate. He thought of replacing the list representation with a tree to 
encode all possible input combinations and then realized the subtrees could be 
merged to form a directed acyclic graph. This led him to formulate the Re- 
duced Ordered Binary Decision Diagram (ROBDD, but often simply referred to 
as “BDD”) representation and to devise algorithms for performing operations 
on Boolean functions based on graph algorithms operating on BDDs. He first 
published these ideas in 1985 [12], with a more complete description in the well 
known 1986 paper [13] (submitted for publication in 1984). 

The success of BDDs stems from an interrelated set of issues: 

■ The BDD data structure is based on a maximal sharing of substructure. 
They are not as prone to exponential blow up as are other representations. 

■ Boolean operations can be performed using simple graph algorithms. The 
complexity of these operations are polynomial in the graph sizes. They 
gain efficiency by exploiting the sharing within BDDs. 

■ They provide a single, homogeneous representation of the problem space. 
For example, with symbolic model checking BDDs are used to represent 
the system being modeled, and the sets of possible states of the system. 
By contrast, many EDA programs shift back and forth between many dif- 
ferent data structures. 

■ By providing a general purpose Boolean manipulation engine, BDDs help 
application developers think in more abstract terms. Looking at earlier 
work, we can often see where the application developer muddles concerns 
about the problem with how the problem is represented. 
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2.3 The Effectiveness of BDDs for Design 
Automation 

Although not among the selected papers for this volume, ICCAD papers by 
Malik et al. [48] and Fujita et al. [35] put BDDs on the map to the larger CAD 
community. They showed 1) that fairly elementary heuristics could select rea- 
sonably good variable orders for combinational circuits, and 2) that BDDs could 
then be constructed for large benchmark circuits enabling tasks to be performed 
(e.g., equivalence checking) that far exceeded what had been done before. 

2.4 Dynamic Variable Ordering 

Rudell’s work [62] strongly reinforces the advantage of having a separate 
Boolean manipulation engine. Previously, others had shown that this engine 
could handle housecleaning tasks such as automatic garbage collection [52] and 
cache management [10]. Rudell showed that it could also handle the task of 
continuously improving the ordering to minimize storage requirements. While 
a program is running, the BDDs pointed to by the application keep changing in 
structure (without changing the underlying function being represented.) But, the 
user need not be concerned with this. 

The work is also a masterpiece of careful engineering. Rudell recognized that 
the BDD transformations could be made without having the update any of the 
pointers from external sources into the BDD data structure. This allowed many 
BDD-based applications to be “retrofitted” to use dynamic variable ordering 
with only minor changes. In practice, dynamic variable ordering often greatly 
increases the runtime of BDD-based applications, but it can also enable suc- 
cessful completion in cases that would otherwise fail due to excessive memory 
requirements. 

2.5 Combinational Equivalence Checking with 
BDDs 

Researchers at Bull Research, headed by Francois Anceau, were “early 
adopters” of BDDs. They showed that BDDs could be used to perform equiv- 
alence checking of combinational circuits in an industrial setting in 1987, pub- 
lished in a paper [46] at DAC in 1988. (The earlier work at IBM was not widely 
known, and did not use conventional BDD algorithms.) Their ICCAD paper [47] 
showed that once you have good equivalence checking and a powerful Boolean 
engine, you could go beyond a basic Yes/No equivalence check and attempt to 
determine why the two circuits are not equivalent. It is based on looking for 
small variants of the circuit (e.g., inserting one inverter or changing one Nand to 
Nor) to see if the circuits could be made equivalent. They encode these variants 
symbolically, so that one run of the engine can test all candidate variations. Al- 
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though this particular application of BDDs has not had widespread use, the paper 
illustrates a general principle of using BDDs. By using symbolic encoding, one 
can often replace a long series of tests with a single symbolic evaluation. 

Further developments [41, 38] have continued in diagnostic methods for com- 
binational equivalence checking. 

2.6 Sequential Verification 

Sequential equivalence checking requires proving that two state machines 
have identical behavior. A common approach is to cast this as a reachability 
problem. First, a composite circuit is constructed consisting of the two original 
circuits, plus logic that indicates whether the two circuits have different output 
values. The task then becomes to determine whether, starting in a state where 
the two original circuits are in their initial states, the composite circuit can ever 
reach a state where the comparator circuit indicates that the two circuits have 
generated different outputs. If so, the circuits are inequivalent. This is equiva- 
lent to performing model checking on the composite circuit with the temporal 
logic query EF t (“is it ever possible for t to hold?”), where t is the output of the 
comparator circuit. Thus, we can see that sequential equivalence checking is a 
special case of model checking. 

Model checking was first developed by Clarke and Emerson [21, 22] as a 
way to automatically verify properties of synchronization programs. They also 
coined the term “model checking.” The first implementations of model checkers 
used an explicit state representation, encoding each state as a node in a graph. 
For most hardware designs, the number of states is far too large (exponential 
in the number of state variables) to be represented in such a fashion, and hence 
model checkers were originally only applied to very small circuits. 

The major breakthrough for the application of model checking to hardware 
design came with the advent of symbolic model checking, where both the cir- 
cuit model and the set of reachable states are encoded implicitly, typically with 
BDDs. 

The history of symbolic model checking and symbolic FSM equivalence is 
more difficult to trace, with a number of researchers coming up with similar 
ideas independently. It is generally acknowledged that Ken McMillan originated 
the idea of BDD-based symbolic model checking in 1987 and implemented one. 
But, he did not publish any papers about this work at the time. In 1989, Coudert, 
Berthet, and Madre presented two papers [26, 25] on using symbolic state ma- 
chine traversal to verify the equivalence of two finite state machines. Bose and 
Fisher presented a paper [9] describing a symbolic model checker and its imple- 
mentation. Bahnsen and Kukula [3] also sketched some ideas for BDD-based 
state traversal, but they did not have any implementation. 
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A number of advances in model checking were reported in 1990. Burch, 
Clarke, McMillan, and Dill presented their seminal papers [15, 14]. Their work 
was the most general of all, showing that they could symbolically evaluate for- 
mulas in a mu-calculus logic that can express many other forms of logic, includ- 
ing several different temporal logics. Coudert, Madre, and Berthet presented a 
paper [28] on a symbolic, computation-tree-logic (CTL) model checker. In the 
same conference, Pixley [57] presented a sequential equivalence checker, which 
followed a different approach than the reachability approach sketched above. 
Pixley’s method involved determining which pairs of states are equivalent to 
each other, eliminating the need to specify initial states. He also presented de- 
tailed algorithms for symbolic model checking, although his only experimental 
results were for equivalence checking. 

The included paper by Coudert and Madre [27] was the first appearance in 
ICCAD of a paper on symbolic model checking. The subject of the paper was 
some refinements on how to perform the preimage computation more efficiently 
than had done before. Coudert and Madre’s model checker was based on a func- 
tion vector approach, where the set of reachable states is encoded as the range 
of a set of Boolean functions. This approach has not proved as popular as the 
characteristic function approach, where the set of reachable states is encoded 
as a single Boolean function indicating whether or not a given state is reach- 
able. The Coudert and Madre paper also introduced the operations constrain 
and restrict, which were later explored for other applications [45, 63, 1]. 

Since this early research on symbolic model checking, there has been a con- 
tinuous stream of research on ways to improve efficiency. The most successful 
techniques exploit the modularity of circuits by representing and applying the 
transition relation in a partitioned form [16, 53]. 

3. SAT and ATPG Methods 

Binary Decision Diagrams continue to serve as the foundation for many CAD 
algorithms. They provide a canonical representation, which is compact across a 
wide range of practical Boolean functions and efficiently supports a wide range 
of operations such as intersection, inversion, and quantification. In some situ- 
ations, however, a less powerful approach can be more efficient. For example, 
some applications require finding only a subset of the solutions of a Boolean 
equation. Significant advances and fresh applications in dynamic search ap- 
proaches such as SAT and automatic test pattern generation (ATPG) algorithms 
have paralleled those in BDDs. 

3.1 Efficient Search Space Pruning 

GRASP [49] introduced a new generation of SAT solvers that were designed 
and tuned using EDA benchmarks. GRASP and its successors[54, 36] are based 
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on the same DPLL search procedure that has been known for decades, but they 
are much more efficient at pruning the search space to avoid fruitless search. 
In the case of GRASP, the main contribution was “conflict diagnosis,” where 
the solver analyzes the conditions leading to a dead end in the search and in- 
fers from this a general condition that will make the formula unsatisfiable. This 
analysis enables nonchronological backtracking, where the search engine back- 
tracks through multiple levels of decisions. It also enables clause learning, 
where the SAT solver can add information about a failed search (in the form 
of a new clause) to its database and thereby avoid repeating a fruitless search. 
SAT solvers have now supplanted BDDs for many EDA applications where sim- 
ple Boolean operations are required. 

SAT checking has recently flourished as a research area. Each successive 
generation of SAT checker typically outperforms its predecessors by an order 
of magnitude, both in terms of the speed on existing benchmarks and the abil- 
ity to handle larger problems. Much of the recent focus has been on efficiently 
organizing and maintaining the internal data structures, particularly the set of 
clauses. The CHAFF solver [54] demonstrated the value of using clever struc- 
tures to minimize the number of clauses that need to be checked during con- 
straint propagation. Other ideas seen in modem SAT checkers include: restarts, 
where the solver abandons the current search tree and starts a new one, as well 
as refined techniques for deciding which elements to discard from the clause set, 
which variable to split on next, and which value (0 or 1) to try first for a splitting 
variable [36]. 

One important application of SAT checkers has been to a limited form of 
model checking, known as a bounded model checker[6\. Bounded model check- 
ing involves mnning the circuit for a fixed number of steps fi'om its initial state 
and determining whether it satisfies the specified temporal properties. This does 
not guarantee that the properties would then hold indefinitely, but it serves as a 
useful check, often quickly finding counterexample traces when they exist. The 
main advantage of bounded model checking is that it can be applied to very large 
circuits [7, 8]. 

3.2 Exploiting Similarity 

Often two combinational circuits being compared for equivalence will have 
many functionally equivalent internal signals. This observation was already ex- 
ploited quite early [33, 4] in the evolution of formal equivalence checkers and 
has since evolved into the most successful approach to improving performance 
and capacity. 

The equivalence checker developed by Berman and Trevillyan [5] detects 
equivalent points within the two circuits and then partitions the circuits, treating 
these equivalence points as primary inputs and outputs of the subcircuits. The 
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key idea in this paper is the use of a weighted graph min-cut algorithm to mini- 
mize the number of subcircuit primary inputs, thus improving efficiency of the 
equivalence check. Treating internal equivalences as primary inputs introduces 
the possibility of false negatives. This paper outlines a method for eliminat- 
ing false negatives by incrementally shifting cutpoints back toward the original 
primary inputs, using BDDs and the compose operation. 

While [5] relies primarily on exhaustive simulation, Kunz [43] used ATPG 
methods to detect and exploit internal equivalences. This more powerful method 
can be used without introducing internal cutpoints, thus avoiding the false neg- 
ative problem. Advances in this area continue to be made by many researchers, 
e.g., [58, 39, 51, 42, 17]. Sequential methods that exploit internal equivalences 
have also been developed [65, 34]. 

3.3 Leveraging Observability 

To further improve the efficiency of combinational equivalence checking, one 
can extend the notion of internal equivalence to include signals that are different 
functions of the primary inputs but whose difference is unobservable at primary 
outputs or state registers. Cemy and Mauras [19] introduced such a method that 
relied on BDD characteristic functions across full cuts of a circuit. 

Brand [11] provided improved efficiency with an ATPG-based combinational 
equivalence checker. This paper introduced the concept of the joining two candi- 
date equivalent signals with a “Miter” circuit, which indicates whether a differ- 
ence in the signal values is observable at the primary outputs. The complexity of 
the equivalence check is reduced by repeatedly transforming this joined circuit 
based on detected equivalences. Broadening the class of equivalences detected, 
from signals which are identical functions of the primary inputs to signals whose 
functional difference is undetectable at the primary outputs, allows further sim- 
plifications of the problem so that ATPG can be used on larger problems without 
introducing internal cutpoints with their risk of false negatives. 

4. Conclusions 

The progress in formal verification from the late 1980s through today has 
been remarkable. The research community has developed and refined both the 
underlying symbolic manipulation engines as well as the verification tools that 
use these engines to reason about complex combinational and sequential cir- 
cuits. In addition, a number of different equivalence and property checkers have 
become available commercially. Formal combinational equivalence checking 
tools have become robust enough to be incorporated routinely into industrial 
methodologies. Still, the needs of the electronics industry in terms of capacity, 
performance, and functionality far exceed the capabilities of current sequential 
verification tools, and even combinational checking occasionally fails due to in- 
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adequate capacity. We can anticipate that this part of the EDA community will 
continue to flourish in its ideas and importance. 
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Abstract 

This paper presents the original extensions brought to Priam to automate both the diagnosis and 
the rectification of the design errors detected by this tool. Priam is an industrial automated for- 
mal verifier used to check the functional correctness of digital circuits of up to 20000 transistors. 
These extensions implement a new approach to diagnosis based on Boolean equation solving. In 
particular, no enumeration of the faulty patterns is necessary to find out the incorrect gates in the 
circuit. The diagnosis system can handle any circuit that can be verified by Priam. 



1. Introduction 

Priam is an automated formal verifier of digital circuits now used by indus- 
trial designers [6]. It checks the functional correctness of digital circuits with 
respect to their specifications. The behavioural specifications as well as the cir- 
cuit descriptions are written in the hardware description language LDS. Priam 
formally proves the equivalence or the implication between LDS programs and 
thus establishes the correctness of the described circuits. 

When Priam detects an error during the verification of a circuit, it provides 
the circuit designer with a description of the error: its kind, the related variable, 
and the equation of the set of input patterns that make the error observable. 
However, until now, no help was given to the designer to diagnose the error, 
that is to find out the reasons of the incorrect behaviour of the circuit. Except 
for very simple errors, this diagnosis task generally takes several hours to which 
must be added several hours for the rectification. 

This paper presents the diagnosis and rectification system we have developed 
for the debugging of gate level circuit descriptions. This system is based on 
the equation solving facility of the automated propositional theorem prover of 
Priam. The diagnosis and rectification are pure formal methods. The diag- 
nosis method overcomes the defaults of the methods based on enumeration of 
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the faulty patterns. For instance, under the single fault assumption, each gate 
connected to some incorrect output is analyzed once to determine whether it is 
responsible for the incorrect value at this output. 

The paper is divided in 4 parts. Part 2 presents the verifier Priam and its 
underlying propositional provC;r. Part 3 describes the diagnosis method. Part 4 
explains how the rectification system determines whether the possibly incorrect 
gates of the circuit can be modified to make the circuit function well. Part 5 
gives the fundamental results on theorem proving and Boolean equation solving 
that underly the diagnosis system. 

2. The Formal Verifier Priam 

Priam is a formal verifier of functional correctness of digital circuits. This 
tool has been integrated in the CAD system used by the circuit designers at Bull. 
Table 1 gives some verification times for industrial circuits. Note that none of 
these circuits could have been verified by simulation because of their very large 
numbers of inputs. 

Within bull’s design methodology, the behaviour of synchronous circuits is 
described with the hardware description language LDS. Descriptions are at the 
cycle level. They are written in a procedural way like VHDL processes [11]. 
This means that the behavioural specification of a circuit written in LDS de- 
scribes how its output and its transition functions are computed. All the storage 
elements of the circuit must be declared, and there must be no loop without an 
included storage element in the circuit. 



Circuit 


#Inputs 


#Outputs 


#Trans. 


Time 


Operabc 


160 


50 


4100 


9 mn 


Addmul 


100 


45 


5400 


15 mn 


Tdat 


297 


192 


8400 


12 mn 


Scd5 


663 


591 


17000 


20 mn 



Table 1. Verification Times with PRIAM (BULL DPX5000). 

Priam uses symbolic execution of the LDS description of the circuit to com- 
pute its output and transition functions. Symbolic execution consists in exe- 
cuting a program fP with symbolic instead of logical values assigned to its in- 
puts [4]. Since the output and transition functions of the circuit can be partial 
functions, Priam manipulates partial Boolean functions that are represented by 
contexted values. A contexted value (Cf,Vf) is a couple of Boolean expressions 
where C/ represents the domain of definition of the function / and Vf represents 
the function / on its domain of definition. 

All along the symbolic execution of a LDS program fP, PRIAM has to make 
proofs in order to establish that the program fP is correct with respect to LDS se- 
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mantics [6]. PRIAM is built on a powerful propositional prover that makes these 
proofs simple to perform. This prover is based on a new canonical representa- 
tion of propositional formulae called the typed decision graphs (TDG) [2, 6]. 

The formal comparison between two LDS programs and (Pr is performed 
through two steps. First Priam symbolically executes the programs % and 
iPr to compute the contexted values of each output o of the programs. We 
note these contexted values {Co{(Ps),Vo{(Ps)) and (C<,(!Pr),V'o(fPr)) respectively. 
Priam then compares these contexted values according to the selected compar- 
ison criterion. If (Pr must be proved to be implied by (Pg, which is the mostly 
used comparison criterion between LDS programs, the formulae to be proved 
valid for each output o are the following: 

Co{(Ps) => Co{(Pr), and Co{!Ps) (Voi^Pg) ^ V«(fP.)). 

When Priam proves that these formulae do not hold for some output of the 
programs ^Pg and fP,-, then the circuit is declared to be functionally incorrect. 
Priam then provides the designer with a description of the error: the associated 
output, the error kind (incorrect context or incorrect value), and the equation of 
the set of all the faulty patterns. However until now Priam did not provide the 
designer with any help to find the causes of this error. 

Next section shows that the tedious diagnosis task can be automated. It then 
shows that under some conditions, the diagnosis system can also provide the 
designer with possible rectifications of the circuit. 

3. Automating The Diagnosis of Design Errors 

This section shows that diagnosing design errors is a simpler problem than 
diagnosing faults in real circuits. It then presents the diagnosis method and it 
gives experimental results. 

Diagnosing design errors is quite different from diagnosing faults in real cir- 
cuits. Many different kinds of errors can be introduced during the fabrication 
of a circuit that are difficult to model correctly [7]. On the other hand, when 
dealing with design errors, only one fault needs to be modelled: a circuit does 
not function well because some of its gates are incorrect. This does not mean 
that these gates are incorrectly implemented but rather that they are misused in 
the logical network. For this reason when several outputs of a circuit are de- 
clared incorrect by Priam the errors are considered independent and analysed 
separately. 

Different methods have been proposed to diagnose errors in circuits whose 
structure is known [9, 10]. All are based on enumeration of the faulty patterns, 
which are the input patterns that make the error observable. In the worst case, 
all faulty patterns must be enumerated to determine the set of possibly incorrect 
gates. Moreover, none of these methods directly supports automated rectifica- 
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tion. The diagnosis method proposed here eliminates this enumeration. Under 
the single fault assumption, a one-pass process over the gates of the circuit pro- 
duces the exact set of gates that can be held responsible for the error detected 
by Priam. This set is empty if and only if the single fault assumption does not 
hold. 



3.1 The Diagnosis Method 

For the sake of clarity this section presents the diagnosis on a pure combi- 
national circuit, that is a circuit without any transistors used as switches. We 
consider a combinational circuit C with n inputs noted i‘i , • • • , i„. This circuit has 
got several outputs and at least one of these outputs o has an incorrect value. 
This means that the specified Boolean function fs and the Boolean function fr 
produced by the circuit at this output o are not equivalent: fr^ fs- 

The first step in the diagnosis consists in computing the set S of gates in the 
circuit that are directly or indirectly connected to the output o. Indeed only these 
gates can be responsible for the error. Following [7], we call this set S of gates 
the coverage cone of the output o. It contains in general a small part of all the 
gates of the circuit. It is computed, like in [7], by an algorithm that traverses the 
circuit C from the output o to the inputs j'l , • • • , 

The second step in the diagnosis is to determine which gates in the coverage 
cone 5 are really responsible for the incorrect value of the variable o. Under 
the single fault assumption, this problem is the same as determining whether 
the behaviour of any gate G in 5 can be modified in such a way that the output 
function fr becomes equivalent to the expected function fs- If we note [oi • • • Op] 
the outputs of the gate G, the diagnosis problems comes down to finding a p- 
tuple of Boolean functions [/^j • • • /^p] that should be produced by the gate G so 
that fr = fs. 

In order to find the functions [f'o\--- fl^, we create a vector of Boolean 
variables noted [z\---Zp], and we propagate them in the circuit C in place of 
the functions [fo\ • • • fop] actually produced by the gate G. This propagation, 
which uses Priam’s symbolic execution mechanism, produces the function 
frih fZp) at the output o. Thanks to the free variables , • • •, and 
Zp, the Boolean function fr represents all the Boolean functions that can be ob- 
tained at the output o by modifying the behaviour of the gate G. In order for the 
functions fs and f to be equivalent, we must find, for each set of values that can 
be assigned to the inputs of the circuit, the values of the variables z\,--- ,Zp such 
that: fs{iu---fn) = frih,---,in,z\,---,Zp). This is expressed by the following 
theorem. 
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Theorem 1 The gate G is possibly responsible for the incorrect value at the 
output o if and only if the following formula is valid: 

(C) ' ‘ ’ ^/i) 3^1 ■ ■ ■ Zp^fr{i \ ) ■ ■ ■ ) in^Zl ) ■ ■ ■ jZp) fsi.ll > ' ‘ > In)' 

Section 5 explains how this non trivial theorem can be automatically proved 
by the propositional prover of Priam. It also shows that the proof procedure 
can give us the functional values of the variables zi , • • •, and Zp. These functional 
values represent all the functions that could be produced by the gate in order for 
the circuit to be correct. These functional values are used by the rectification 
system presented in Section 4 to provide the designer with the correct equation 
of the gate G. 

3.2 Experimental Results 

The performance of the diagnosis method mainly depends on the time needed 
to propagate the vector [zi •■■zf\ to the output o. The variables z\,--- ,Zp replace 
possibly complex Boolean functions so their propagation in place of these func- 
tions can be expected to require less time. This is clearly shown in Table 2. It 
guarantees that any circuit that can be verified with Priam can be handled by 
the diagnosis system. 



Circuit/Output 


#Stats 


#Cone 


CPU Time 


#Faults 


Add32/sl5 


269 


53 


48 s 


3 


Alu32/sl4 


174 


51 


192 s 


4 


Alu32/s0 


174 


89 


660 s 


12 



Table 2. Diagnosis times for 32 bit circuits (BULL DPX5000). 



Table 2 gives the diagnosis times for some industrial circuits. It gives the 
number of statements (“#Stats”) in the LDS program describing the circuit, the 
number of statements in the coverage cone of the incorrect output (“#Cone”), 
the total time (“CPU Time”) needed for the diagnosis, and finally the number of 
possibly incorrect statements (or gates) (“#Faults”). Note that the circuit Add32 
is described at the gate level and that the circuit i4/w32 is built out of standard 
cells. 

The diagnosis procedure given above uses the single fault assumption. Expe- 
rience shows that experienced designers make very few faults, so this assump- 
tion holds in many cases. When it does not hold, the diagnosis system will be 
unable to find a gate that can be held responsible for the incorrect behavior of the 
circuit. Nevertheless the diagnosis procedure can be applied to multiple faults, 
except that a tuple of gates of the set S instead of only one gate must be consid- 
ered at each step. We are then faced with the usual combinatorial explosion in 
the search. 
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4. Rectifying Design Errors 

Section 3 has presented a procedure to find the gates that can be responsible 
for the incorrect value of some output of a circuit. We now show that this pro- 
cedure gives us enough information to determine which of these gates can be 
rectified in order for the output value to become correct. 

The key idea here is that the formula (C) given in Section 3.1 can be consid- 
ered as a theorem to be proved, but also as an equation to be solved. Solving this 
equation with the procedure given in Section 5 provides us with the functional 
values [/^j • •• fop] of the variables [zi ■■■Zp\. These functions are the Boolean 
functions that should be produced by the gate G under analysis for the output 
function /;- to be correct. Note that there can exist more than one solution to the 
Boolean equation (C). The solver then introduces a finite number of parameters 
in the functions. This means that [/^j • • • fop] are higher order functions. They 
represent the set f of all different vectors of output functions of the gate G for 
which the value of the output o is correct. 

The problem we address here is to determine whether it is possible to rectify 
the gate G, without changing the global structure of the circuit. If this is possi- 
ble, then the cost of rectifying the design error is minimal because the structure 
of the circuit is not modified. We note [f\ • • - /m] the tuple of Boolean func- 
tions taken as inputs by the gate G. The gate G produces the output functions 
[fo\ ' ' ' /op]' This means that G implements p compositions noted h\,---, and 
hp such that • • • ,/m) = foi- To rectify the gate G is to find p composi- 
tions h[,---, and h'p that produce correct functions at the gate’s outputs. This is 
expressed by the following theorem. 

Theorem 2 The gate G can be rectified if and only if the following formula is 
valid: 



{R)3h\---h'p,[h\{fu---Jrn)---h'piA,---J,n)]ef. 

The resolution procedure of this equation is given in section 5.2. It returns, 
if they exist, all the functions [h\ • • • h'p] that compose the input functions of the 
gate G in such a way that the circuit C produces a correct function at the output 
0 . When several functions exist, the designer has to choose between them the 
best one according to some criteria. 

The rectification problem is essentially combinatorial. When the gate G has 
m inputs and p outputs, the resolution procedure given in Section 5.2 has to 
deal with {p x 2'") Boolean variables. This combinatorial explosion restricts the 
rectification method to be applied on circuits whose gates have a small number 
of inputs and outputs. When this is not the case, for instance when complex 
standard cells are used in the circuit, the rectification problem looks very much 
like the synthesis one and the same problems are encountered. 
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4.1 Experimental Results 

The CPU time needed to rectify a gate directly depends on its number of 
inputs and the complexity of its input functions. No rectification is attempted 
when the number of inputs is larger than 10. Experience shows that for the 32 
bit circuits whose diagnosis times are given in Table 2, these CPU times are less 
than 5 seconds. None of the 3 possibly incorrect gates of the 32 bit adder Add32 
can be rectified. For the 32 bit ALU Alu32, the system finds that only one gate 
can be rectified in order for the output sl4 to become correct, and only one gate 
for the output sO. Eventually these two gates are the same one. 

5. The Boolean Equation Solver 

We have presented in [6] the propositional theorem prover underlying Priam. 
This prover is based on a new canonical form of the propositional formulae 
that we have called the Shannon typed canonical form. Formulae in this form 
are represented by graphs called Typed Decision Graphs (TDG). In Priam the 
prover is mainly used as a rewriting system to reduce propositional formulae 
to their canonical form. In this section we show that it can also be used to 
prove quantified formulae valid. We then show that proving a quantified formula 
valid is similar to solving a Boolean equation and we give in Section 5.2 several 
resolution procedures. 

5.1 Validity of a Quantified Expression 

We consider here a higher-order logic with a finite domain of interpretation. 
Since the domain of interpretation is finite, any closed formula in this logic 
can be rewritten into a closed formula whose variables are propositional vari- 
ables [5]. For instance, the formula (Vxia: 23 x 3 (a:i \/ x-^) (jca VX 3 )) is such a 
quantified formula. A term is a formula without any quantifier. We note fix=y 
the formula obtained by substituting each occurrence of the variable x with the 
termy. 

The validity of any closed quantified formula is inductively defined in the 
following way: 

■ the formula True is valid and the formula False is not valid. 

■ the closed formula (Va: f) is valid if and only if both the formulae fjx^Faise 
and fix=True valid. 

■ the closed formula (3x:/) is valid if and only if at least one of the formulae 
f/xMe and fix^True is Valid. 

The proof procedure called valid given in Figure 1 is a direct implementation 
of this inductive definition. The formula to be proved valid is {Q\x\ • • • Q„Xnt), 
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type vertex = record 

index : integer; low, high : tdg; 

end; 

type tdg = record 

tag : node; vertex; 

end; 

var True, False : tdg; 

function valid(t : tdg) : tdg; 
var V : vertex; 

begin 

if (t = True) or (t = False) then return t; 

V = t.node; 

if (V-quantified v.root) then 
if (t.tag = ’+’) then 

return And(valid(v.low), valid(v.high)); 

else 

return And(valid(Not(v.low)), valid(Not(v.high))); 
else 

if (t.tag = ’+’) then 

return Or(valid(v.low), valid(v.high)); 
else 

return Or(valid(Not(v.low)), valid(Not(v.high))); 

end; 



Figure 1. Proof Procedure of a quantified closed formula. 



where (2i ? * • * 5 Gn are quantifiers and fis a term. The function valid takes as input 
the Shannon tree of the term t built with the order xi < • • • < The proof 
procedure traverses this tree and determines, at each step, whether the subterm 
represented by some vertex in the tree is valid. The function returns True if the 
term is valid, else it returns False, 

When all the quantifiers of the formula {Q\x\ • • • QnXnt) are identical, the 
canonicity of the representation makes the proof trivial. For instance, the for- 
mula (Vjci • • • Xnt) is valid if and only if the term t is a tautology. In the same 
way, the formula {3x\ • • • Xnt) is valid if and only if the term t is not identically 
equal to False. These remarks are used to optimize the proof procedure given 
in Figure 1. In a more general way, there is a proof procedure [3] that does not 
require the TDG of the term t to be built with the order x\<-<Xn- 

5.2 Solving Equations 

Any closed formula {Q\X\ • • • QnXnt) can be seen as an equation whose un- 
known variables are its existentially quantified variables. From the theoretical 
point of view, solving the equation ‘V = True'' is the same as finding Skolem’s 
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functions of the existentially quantified variables [3]. We give here the reso- 
lution procedure for an equation that has only unknown variable. The general 
resolution procedure can be found in [3]. 

Consider the equation “t{x \ , • • • ,x„,y) = True” with only one unknown vari- 
able y. We note tree, the Shannon decomposition tree of the term t with the 
order y < JCi < ■■■ <x„. This tree can be written ((-ly) A L) V (y A ^), where L 
and H are both in canonical form. Then two cases must be considered; 

■ the formula (L V H) is not a tautology. Then for some interpretation of 
the variables x\,--- ,x„, both the formulae L and H evaluate to False, and 
the term t also evaluates to False. It is thus not possible to assign a value 
to y such that t evaluates to True. This means that the equation has no 
solution. 

■ the formula {LV H) is a tautology. Then for any interpretation of the 
variables xi,---,Xn, either the formulae L or the formula H evaluates to 
True. In any case, it is possible to assign a value to y such that t evaluates 
to True. When both L and H evaluate to True (this can happen if the 
formula (LA//) is not an antilogy), then any value can be assigned to y. 
The solution of the equation is: y = (^L) V (// A v), where v is a new free 
variable. 

Solving a Functional Equation 

This section explains how the rectification system solves the functional equa- 
tion (R) presented in Section 4.3. Consider a set of Boolean functions /i , • • • ,/m, 
and / whose variables are , • • • , i„. The problem we address here is to find a 
Boolean function h such that: 

{F\)h{fw,U)=f. 

There are 2^, where N = 2"*, Boolean functions with m inputs vi , • • • , v„,. All 
these functions can be represented by a single TDG Fm built out of the variables 
vi , • • • , Vm and N — 2"* new Boolean variables that we note ,hN- This TDG 
has — 1 vertices. Any interpretation of the variables h\,---,hN defines 
one and only one of these Boolean functions. Figure 2.a shows F 2 . 

Starting from the TDG Fm, we can compute the TDG Fq that represents the set 
of Boolean functions that can be obtained by composing the Boolean functions 
/i , • • •, and fm- This TDG is obtained by substituting the TDGs of the functions 
/i) • • • )/m to the variables vi, • • • ,Vm in the TDG Fm. The variables occurring in 
Fc are the inputs variables I'l , • • • , in, and the variables Li , • • • , /i/y. Finally, finding 
a function h that is a solution of (FI) is the same as finding a tuple h^) 

that is solution of the equation: 

(F2) 3hi---hN^i\---i„Fc{hi,--- ,in) = /(ii,---,in)- 
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Figure 2. Resolution of a functional equation. 



When this equation has solutions, the resolution procedure returns the TDGs 
representing the values of (/ii , • • • , • If there is only one solution to the equa- 

tion then all these TDGs are equal to either True or False. However if there are 
several solutions then some parameters (at the most N) occur in the TDGs. These 
TDGs are then substituted back in the TDG Fm and we obtain the TDG that rep- 
resents all the Boolean functions that are solutions of the functional equation 
(FI). 

Figure 2 shows the application of this resolution procedure to the case where 
/i = ((-la) A b) V (a© (-ic)), /2 = a V (->c), and the function to be obtained 
is / = ((-la) Ab Ac). Figure 2.b shows the TDG Fc that represents all the 
functions that can be obtained by composing the functions f\ and f^. Figure 2.c 
shows the TDG TDG/ of the function /. In order for the TDGs Fc and TDG/ to 
be isomorph we must have h\=h 2 = h 4 = False and hj = True. Figure 2.d is the 
TDG of the function h — vl A (->v2) that is the only solution of the functional 
equation. 



6. Conclusion 

This paper has presented the procedures we have developed to automate the 
diagnosis and the rectification of the design errors detected by the formal veri- 
fier Priam. The original diagnosis method is based on algorithms for proving 
quantified propositional formulae and for solving Boolean equations. These 
algorithms operate on formulae in Shannon typed canonical form that are repre- 
sented with Typed Decision Graphs. The diagnosis procedure can be applied on 
any circuit that can be verified by Priam. When the diagnosis system has for- 
mally proved that some gate can be held responsible for the incorrect behaviour 
of the circuit, the rectification system is called. It determines whether the cir- 
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cuit can be rectified without changing its structure, and when this is possible, it 
provides the designers with the rectified equations of the gate. 
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Abstract 

Determining whether or not two circuits are functionally equivalent is of fundamental importance 
in many phases of the design of computer logic. We describe a new method for circuit equivalence 
which proceeds by reducing the question of whether two circuits are equivalent to a number of 
more easily answered questions concerning the equivalence of smaller, related circuits. This 
method can be used to extend the power of any given equivalence-checking algorithm. We report 
the results of experiments evaluating our technique. 



1. Introduction 

Deciding whether two logic designs are functionally equivalent is a proce- 
dure with many applications in the design of computer logic. Unfortunately, 
the problem of determining functional equivalence of combinational circuits is 
known to be co-NP complete [4]. This suggests that it will be difficult to de- 
velop algorithms which solve this problem efficiently in all cases. In this paper, 
we present a method which reduces the question of the equivalence of two cir- 
cuits to a number of questions about the equivalence of smaller circuits. Since 
the performance of equivalence checkers frequently degrades exponentially with 
the size of the circuits to be checked, this method can extend the power of an 
equivalence checker and can lead to vastly improved mn times. We report the 
results of some experiments mn to evaluate the method. 

Until now, all equivalence checkers have used flat logic models for the cir- 
cuits they are trying to compare. The possibility of developing “decomposition 
based” checkers, which prove equivalences between internal signals in the cir- 
cuits being compared and using these internal equivalences to establish equiva- 
lence between the entire circuit, has been suggested [1,6] however, no practical 
method for using these internal points has been offered. Our technique supplies 
this missing piece. We automatically discover a decomposition that facilitates 
equivalence checking. We do this by employing the min/cut algorithm [7] in 
a way which permits us to see intermediate equivalence points in the logic to 
break the full equivalence problem into a number of smaller, more manageable 
problems in much the same way that the use of lemmas allows us to shorten and 
simplify the proof of mathematical theorems. 
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Our experiments suggest that our technique has potential; however, there are 
limitations. One limitation is that, since our method depends on establishing in- 
ternal equivalences, it will not work if the two circuits being compared have no 
equivalent internal signals. We have found that in real applications, such as de- 
sign verification, there are generally enough corresponding signals to enable our 
method to achieve significant speedup. We realize that in certain applications 
this may not be the case. 

Another limitation is inherent in any technique that tries to decide equiva- 
lence by decomposition. Since the basic step in a procedure of this type is to 
identify internal equivalences and then to proceed with an equivalence proof 
treating these internal equivalences as independent inputs, we may fail to recog- 
nize that certain functions are identical. We call this failure to recognize identity 
& false negative. In the final section of this paper, we describe an unimplemented 
method developed with R. E. Bryant, which may address this problem. 

Despite these two limitations, the experiments that we report suggest that this 
technique has significant potential for extending the power of known equiva- 
lence checkers. 

The paper is organized as follows. In section 2.1, we introduce some required 
terminology. In 2.2 - 2.4 the different parts of the algorithm are described. In 
Section 3, we illustrate the algorithm with an example and report our experi- 
mental results. Section 4 contains an approach to the problem of false negatives, 
and Section 5 is a summary. 

2. Description of the Algorithm 

In this section, we present a method which reduces the question of the equiv- 
alence of two circuits to a number of questions about the equivalence of smaller 
circuits. 

Our method is based on maintaining two data structures, PEF-list and EQ- 
list. During the algorithm, EQ-list always contains a list of pairs of signals 
known to be equivalent. The input to the algorithm includes a list of correspon- 
dences between the inputs to the two circuits being tested; this information is 
used to initialize EQ-list. Eventually, if the circuits are equivalent, the corre- 
sponding output pairs will be placed in EQ-list. PEF-list contains a list of pairs 
of signals which may be equivalent. 

The equivalence checking process is broken into three cooperating processes 
(which are described in detail in the remainder of this section): 



1 BC: A base checker which can check equivalence of signal pairs produced 
by “small” circuits. 



2 PEE: A potential equivalence finder which builds PEF-list. 
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3 CDP: A circuit-decomposition process which maintains EQ-list. This 
process uses the current EQ-list to determine whether elements of PEF- 
list can be added to EQ-list. 

Pseudo-code for the algorithm is shown in Figure 1. Our main result is a 
technique which uses the min-cut algorithm [7] to perform step 3. 

2 . 1 Terminology 

Circuits are represented as directed acyclic graphs (dags). The nodes of the 
dag are of three types: E^UT, OUTPUT, or combinational. Combinational 
nodes are labeled with a description of the function that they perform. We use 
the terms gate, node and box interchangeably to refer to the nodes of our dag, 
and the terms wire or signal to refer to the set of edges beginning at a node. 
Equivalence of synchronous logic (using the same state encoding) can be ac- 
commodated in our model by expanding the memory elements in a suitable 
manner and treating their inputs and outputs as circuit outputs and inputs re- 
spectively. If s and t are signals, we write t s if an edge of t has as a sink the 
node at which the edges of s begin. A path from t to s is a sequence of signals , 
;i, ..., jr, such that t ;i -)■ ...jr «• 

The algorithm we present compares two circuits which we call the “refer- 
ence model” and the “comparison model”. The algorithm begins with a list 
of corresponding signals from the two models. These signals are called initial 
equivalence points. An assignment to the inputs of the two circuits is consistent 
if it assigns the same values to corresponding inputs in the two circuits. A pair 
of signals, (s,s’) is &n (actual) equivalence if s and s’ receive the same values for 
all consistent assignments to initial equivalences. In this case, we say that s and 
s’ Zixt functionally equivalent and they are called equivalence points. 

Let sx and 5a : {0, 1}" -> {0, 1}, Fx and Fa : {0, 1}"* -> {0, 1}, and g : {0, 1}'" 
{0, 1}” be such that =Fxog and 5a = Fi°g- The triple (Fx, Fa, g) is called a 
simultaneous decomposition of 5i and S 2 , and m is the size of the decomposition. 
The following simple result, which we state without proof, is the basis for our 
equivalence reduction technique: 

Theorem: If (Fx, Fa, g) is a simultaneous decomposition of 5i and sa. then 
(Fx = Fa) =:► si = S 2 . 

2.2 Potential Equivalence Finder (PEF) 

The potential-equivalence finding process produces a list, PEF-list, of signal 
pairs. Every pair of signals s and s’ which carry identical functions should be 
on the list, and ideally, we would like to include only pairs of equivalent sig- 
nals from the circuits. Of course, this is finding all equivalent signals, so we 
use an efficient “signal signature” to eliminate pairs which can easily be shown 
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Min-cut Equivalence Algorithm 

INPUT: 2 circuits and a list of corresponding inputs 
0: Initialize EQ-list to pairs of corresponding signals 
1: Call PEF to build PEF-list /* see Section 2.2 */ 

2: Do while ( PEF-list is not empty) 

2.1 Take the first pair (s,s’) from PEF-list 
/* Use the base checker, BC, if possible */ 

2.2: If s and s’ are “simple enough” call BC(s,s’) 

If s = s’, put (s,s’) on EQ-list 
/* IF BC cannot work, call DCP (section 2.4) *1 
2.3: If s and s’ are not “simple enough”, call CDP(s,s’) 
Od; 



Figure 1. Min-cut Equivalence Algorithm. 



to be non-equivalent. This reduces the number of non-equivalent pairs which 
will later be tested for equivalence. The method we use computes signatures by 
generating a small number (e.g. 512) of consistent random assignments to the 
initial equivalences and simulating the circuits for these values. This produces 
a vector of values at each signal which we use as the signal’s signature. Signa- 
tures are hashed and all pairs with the same signature are considered potential 
equivalences and are placed on PEF-list. Finally, PEF orders the elements of 
the PEF-list in breadth-first order beginning with inputs. 

2.3 Base Checker (BC) 

Our decomposition method will work with any equivalence checker. In our 
experimental implementation, we used an exhaustive simulation checker which 
could check circuits with up to 30 inputs. 

2.4 Circuit Decomposition Process (CDP) 

The circuit decomposition process performs the bulk of the work. CDP is an 
iterative process which tries to add the elements of PEF-list to EQ-list. At each 
stage, CDP does the following: 

1 Takes the nest pair (s,s’) from PEF-list. These are called the current sig- 
nals. 

2 If s and s’ depend on fewer than 10 inputs (any small number could be 
used), BC is used to check whether they are equivalent. 
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3 If s and s’ depend on more than 10 inputs, derive a simultaneous decom- 
position (Fi, Fi, g) for s and s’. (The technique used to derive the decom- 
position is described below). If size {F\,F 2 ,g)) < 30, use BC to determine 
whether F\ = Fi. If so, add (s,s’) to EQ-list. 

Since PEF-list has been ordered in a breadth-first manner, when a pair is 
checked, “earlier pairs” have already been checked for equivalence and are used 
to derive the decomposition. IF CDP adds an output signal pair to the list, we 
have shown that the outputs are equivalent. 

We will now describe how the simultaneous decomposition of (s,s’) is de- 
rived. First, for signal s, a directed acyclic graph is formed. The nodes of the 
graph include a source node, a sink node, a node for signal s and one node, 
node(t), for every signal, t, which is on EQ-Iist and from which there is a path 
to s. The edges include one from the source to each of the initial equivalence 
points, one from s to the sink and one from node(sigl) to node(sig2) if there 
is a path from sigl to sig2 in the circuit which avoids all signals currently on 
EQ-Iist. A similar dag is built for signal s’, and the source, sink and corre- 
sponding equivalent nodes not including nodes corresponding to signals known 
to be equivalent to the current signals from the two graphs are merged. (That 
is, if (t,v) is on EQ-list and neither t nor v is known to be equivalent to either of 
the current signals, then node (t) and node(v) may be merged). Finally, all nodes 
in the combined graphs which correspond to equivalence points not including 
nodes corresponding to signals known to be equivalent to the current signals 
are split into an in-half and an out-half. All edges which were incident to the 
original unsplit node are made incident to the in-half, all edges which began at 
the unsplit node are moved so they begin at the out-half, and an edge is added 
from the in-half to the out-half of each split node. All edges are given infinite 
weight except the edges from the in-half to the out-half of the split nodes. These 
edges are given weight 1. 

The resulting graph represents how the equivalence points in the two cones of 
logic that feed s and s’ interact to influence these signals. The min/cut algorithm 
is now applied to this graph. Observe that a cut separating the sink and source 
nodes corresponds to a set of equivalences which determine both s and s’, and 
therefore to a simultaneous decomposition, (Fi,F 2 ,g), of s and s’. Because of 
the properties of the min-cut, this is, in some sense, the smallest simultaneous 
decomposition of s and s’. 

3. Example and Experimental Results 
3.1 Example of the Algorithm 

We will illustrate the algorithm using the circuits shown in Figure 2. For the 
sake of illustration, assume that our base checker can handle functions of at most 
2 inputs. 
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Figure 2. An example of the Comparison Algorithm. 



We are given that CO is equivalent to CO’, Cl to Cl’, and C2 to C2’, so we 
begin by initializing 

EQ-Ust= ((CO.CO’), (Cl, Cl’). (C2.C2’)) 

Random simulation determines 

PEF-list = ((C0,C0’),(Cl,Cl’).(C2,C2’),(N0,N0’),(Nl,Nl’),(S0,S0’),(S2.S4’),(S4.S4’)). 

The program begins at (N0,N0’) and finds that NO and NO’, N1 and Nl’, and 
SO and SO’ are equivalent using the base checker. The later potential equivalence 
pairs (S2,S4’) and (S4,S4’) both depend on 3 inputs and, since we are assuming 
our base checker is limited to 2 inputs, we must use the CDP process. The 
processing of both pairs is similar, and so we will use (S4, S4’) for illustration 
since it more fully exercises our algorithm. We will therefore assume that we 
already verified that (S2, S4’) is an actual equivalence and that we are now 
processing the potential equivalence (S4, S4’). 

The signals in the cone of S4 which are currently known to be equivalence 
points are CO, Cl, NO, Nl, SO, C2 and S2. The signals in the cone of S4’ 
which are currently known to be equivalence points are CO’, Cl’, C2’, No’, Nl’ 
and SO’. First, a graph is built using signals, t, on EQ-list from which S4 can 
be reached. A similar graph is made for the comparison model, and they are 
merged together to yield the graph shown in Figure 2. To simplify the diagram, 
instead of splitting nodes corresponding to equivalence points, we have labeled 
the node with the weight which would be assigned to the inserted edge. Nodes 
which cannot be split are labeled infinity. Note that the node corresponding to 
S2 is neither merged nor split because it is known to be equivalent to the current 
signals. Applying the min-cut algorithm results in a cut consisting of the nodes 
representing the SO/SO’ and C2/C2’ equivalence sets. This is indicated by the 
box in Figure 2. The min-cut has reduced the size of the input set that must 
be handled from 3 to 2. The program now finds that S4 and S4’ are equivalent 
using the base checker, treating SO/SO’ and C2/C2’ as independent inputs. Ap- 



Functional Verification 



35 



plying our main theorem permits us to conclude that S4 and S4’ are equivalent 
as functions of the initial equivalences. 

3.2 Experimental Results 

This procedure was implemented in PL/I as part of the Logic Synthesis Sys- 
tem [5]. As BC, we used an exhaustive simulation equivalence checker limited 
to 30 inputs, as described in 2.3. Our test cases included three pieces of IBM 
logic as well as the ISCAS test set. In all cases, the original specification was 
compared to the implementation obtained by running LSS with a target of un- 
restricted NAND gates. The algorithm works best on circuit pairs which have 
a great deal of structural similarity, and it would be interesting to evaluate the 
method by comparing implementations designed using different methodologies. 
We observe, however, that in the context of industrial design, the experiments 
described here are typical examples of the tasks an equivalence algorithm might 
be called upon to perform. 



Model 

Name 


Equiv. 

Output 

Signals 


Max 

Cone 

S/T 


Max 

Cutset 

Size 


Pairs 


Sim 1 


66/66 


140 


19 


- 


Sim 2 


463/477 


100 


55 


- 


Sim 3 


145/186 


97 


24 


- 


C432 


2/7 


27/36 


27 


85 


C499 


32/32 


41/41 


13 


288 


C880 


26/26 


45/45 


12 


157 


C1355 


32/32 


41/41 


10 


754 


C1908 


22/25 


33/33 


11 


1192 


C2670 


50/140 


122 


48 


- 


C5315 


122/123 


67/67 


13 


1188 



Table 1. Functional Comparison Results (explained in Section 3). 

The results are summarized in Table 1. Column 1 identifies the design that 
was run. Column 2 shows the total number of outputs and how many of them 
were proved equivalent. Column 3 shows the largest number of inputs in the 
signature of any output (T) and of any successfully compared output (S). (Where 
only one figure is given, the number is the number of inputs in the cone of the 
largest successfully compared output.) Column 4 shows the largest cut set in any 
successfully compared output. Column 5 shows the number of pairs for which 
a decomposition was used by CDP to show equivalence of two signals. 

We note that in most cases, even though the two designs were structurally 
very different, the program was able to successfully decompose the logic and 
establish equivalences which depend on more than 30 inputs and which would 
have been impossible for the unaided base checker to establish. Note that in 
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approximately half of the cases our method allowed the base checker to establish 
complete equivalence between the two designs, and in all but one of the rest, it 
resulted in significant reductions in the number of inputs which had to be treated 
simultaneously. 

4. Avoiding False Negatives 

As mentioned earlier, it is possible for the min-cut procedure to select a de- 
composition that causes two signals which are actually equal to appear to differ. 
In this section, we describe a method which may help avoid these “false neg- 
atives”. This method, which was developed in conversation with R.E. Bryant, 
is as yet unimplemented. We will, however, explain why we think it may be 
effective in addressing this problem. 

In this section, we assume that we are using the reduced function graph 
method [3] as the base checker, and that the reader is familiar with this rep- 
resentation. Assume that we have processed part of the two circuits as described 
above, that we have established a number of internal equivalences, and that we 
have represented each of the equivalences as reduced function graphs whose 
variables are “earlier” internal equivalences and/or initial equivalences. This re- 
quires a total order on both initial equivalences and internal signals of the two 
circuits. Such an order can be computed using techniques described in [2]. Or- 
dering initial equivalences is discussed in [8, 9]. 

Assume that (s,s’) is a potential equivalence and that s and s’ have simultane- 
ous decomposition {F\, F 2 , g) and that Fi ^ F 2 . We wish to determine whether 
F\o g = F 2 ° g- In addition, for efficiency, we want to do this in a way which 
avoids representing the entire logic graphs as functions of the initial equiva- 
lences. 

The algorithm we suggest has two parts. First, form the reduced function 
graph of H = Fi ©Fa. Second, until no internal equivalences appears as variables 
in H, choose an internal equivalence which occurs as a variable in H, and expand 
this equivalence in terms of earlier equivalences. During step 2 of the algorithm, 
either the resulting function becomes 0, or all internal equivalences have been 
expanded and the result is an expression for the exclusive-or as a function of the 
real inputs, or the required functions grow too large and can no longer be repre- 
sented as reduced function graphs. In the first case, we have shown equivalence, 
and in the second case, we have shown true non-equivalence. In the final case, 
we can draw no conclusion about the two functions. 

If s and s’ are actually equivalent, the dag representation of F\ © F 2 , in terms 
of the internal equivalences may be comparatively small. As the equivalences 
are expanded and more information about the relationships among them is added, 
the exclusive-or must eventually assume its actual value which, if s = s’, is zero. 
These observations suggest that we may be able to compute with the exclusive- 
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or in this way, even in cases where we could not represent either s or s’ directly 
in terms of the initial equivalences. There is, of course, no guarantee that the 
BDDs will remain small; however, we feel it is an interesting approach to the 
problem of false negatives in a “decomposition” equivalence checker. 

5. Summary 

We have described an algorithm for testing functional equivalence between 
two logic designs. Our method can augment the power of any given equiva- 
lence checker by automatically reducing the equivalence problem to a number 
of small related problems. Our experimental implementation demonstrated that 
our method could verify equivalence of circuits whose equivalence could not 
have been verified by the unaugmented algorithm. 

The primary technical contribution is a technique for discovering internal 
equivalences and using them to show the equivalence of the outputs. Our method 
involves the use of signatures to reduce the number of potentially equivalent 
signals, and the use of the min-cut algorithm to reduce the original problem 
to related problems with fewer independent inputs. This algorithm has been 
implemented and has been shown to be effective on sample logic. The test- 
ing technique used in the implementation was exhaustive simulation, but other 
more efficient techniques could be used to increase the power and improve the 
run time of this procedure. 

One danger with “decomposition” equivalence checkers is that an inconsis- 
tent test case may give a false negative. We have outlined an heuristic that could 
be employed to ameliorate this problem. 
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1. Introduction 

Hardware description languages (HDLs) dramatically change the way circuit 
designers work. These languages can be used to describe circuits at a very high 
level of abstraction, which allows the designers to specify the behavior of a 
circuit before realizing it. The validation of these specifications is currently done 
by executing them, which is very costly [2]. This cost motivates the research [3, 
5, 7, 10] done on the automatic verification of temporal properties of finite state 
machines. 

Once the design of the circuit is done, the problem is to verify that the result- 
ing circuit is correct with respect to its specification. Until recently, this verifi- 
cation was done by simulating the circuit and its specification on the same input 
sequences and by comparing their output sequences. The verification method is 
very costly and incomplete because of the large number of input sequences to 
consider [2]. 

This paper presents a unified framework for the verification of synchronous 
circuits. Within this framework the two verification tasks presented above can be 
automatically performed using algorithms based on the same concepts. The first 
idea is to manipulate sets of states and sets of transitions instead of individual 
states and individual transitions. The second idea is to represent these sets by 
Boolean functions and to replace operations on sets with operations on Boolean 
functions. 

Part 2 of the paper defines the two problems addressed here, and then it 
presents the verification algorithms. It shows that these algorithms use the stan- 
dard set operations in addition to two specific operations called “Pre” and “/mg”. 
Part 3 briefly explains why the basic set operations are very efficiently performed 
when sets are denoted by the Typed Decision Graphs of their characteristic func- 
tions. Part 4 presents the new Boolean operators “Constrain” and “Restrict', and 
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the function '"Expand' that support efficiently the "Tmg" and "Pre" operations. 
Part 5 gives experimental results and discusses them. 

2. The Verification Algorithms 

This section defines the model of sequential circuits that will be verified, and 
the two verification problems addressed here. Then it gives for both problems 
an algorithm based on set manipulations. 

2.1 Definitions 

For the sake of clarity we will consider in this paper that the sequential cir- 
cuits that must be verified are deterministic Moore machines. Dealing with 
Mealy machines is done in a similar way. Moreover we assume that these ma- 
chines are completely specified, which means that for any state of the machine, 
(1) the outputs are defined, and (2) for any input pattern, the next state of the 
machine is defined. This is not a limitation since, if the machine is incompletely 
specified, it is possibly to add a dummy state in order to obtain a completely 
specified machine. 

A deterministic Moore machine M is defined by a 6-tuple (Y,I,0,8,X,Init). 
Y is the vector [yi , . . . of Boolean state variables of the machine: a state of 
is defined by the Boolean values of the variables / is the vector 

of n boolean inputs of the machine. 0 is the vector of k boolean outputs of the 
machine. The output function X is a vector of k Boolean functions (one for each 
output) from the set {0, 1}"" into {0, 1}. The transition function 8 is a vector of 
m Boolean functions from {0, 1}"" into {0,1}. Finally Init is the initial state of 
the machine. 

The 6-tuple that defines a sequential circuit can be obtained either from its 
gate level description or from its functional description, using a symbolic exe- 
cution process such as the one used in PRIAM [2]. 

2.2 Verification of Temporal Properties 

The temporal formulas that the verification system takes as input are the state 
formulas of the computation tree logic CTL [7]. This logic is a formalism that 
was specifically developed to express properties of the states and the computa- 
tion paths of finite state systems. The meaning of a state formula is relative to a 
state of machine, which is here defined by the values of its state variables. The 
4 basic kinds of CTL state formulas are the following: 

(1) (yi), . . . , (y„) are state formulas. For any state s, s |= yj if and only if (iff) 
the value of the variable yj is 1 in the state s. 

(2) If / and g are state formulas then so are the formulas (->/), (/ Ag), (/V g), 
(/<^g),and(/=4>g). 
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(3) If / is a state formula, so are the formulas EX{f) and y4X(/): s |= EX{f) 
iff there exists at least one input pattern p such that 8(s,p) [= /, and 
AX{f) =def -£X(-./). 

(4) If / and g are state formulas, so are the formulas E[f U g] and A[f U g]: 
s \= E[f U g] iff there exists at least one path (so> , • • •) with so = s, such 
that 3j((sf 1= g) A (V;(0 <j<i=^sj^ /))), and s |= A[f U g] iff for all 
paths . . .) such that = s, then 3i((s,- 1= g) A (V ;(0 <j<i=^ sj |= 
/)))• 

The first algorithms that have been proposed to verify automatically that some 
machine holds a temporal property [7] used traversal techniques of its state- 
transition graph, which had to be partially or entirely built. This limited the 
application of these algorithms to relatively small machines. 

The verification algorithm used here takes as inputs a machine iM = {Y,1, 0, 
X,6,Init) and the temporal formula / to be verified. It recursively computes 
the set of states of that satisfy the formula / from the sets of states that 
satisfy its subformulas. At each step there are only 4 basic cases to consider that 
correspond to the 4 basic kinds of formulas given above. Once the set of states 
F that satisfy the whole formula is obtained, to check whether (s |= /) for some 
state s of comes down to checking whether s belongs to F [3]. 

The sets of states that satisfy formula of type (1) and (2) can be computed 
using the basic set operations. For instance, the set of states that satisfy the 
formula (yi) is {1} x {0,1}'”“*; if F and G are the sets of states that satisfy 
the formulas f and g respectively, then the set of states that satisfy the formula 
(/Vg)is(FUG). 

The other kinds of formulas are treated with the “Pre” operation, either in one 
step (EX and AX formulas) or by fixed point algorithms (EU and Ml formulas). 
By definition. 



Pre(Q,A,B) = {s\(s eA) A(Qp8{s,p) e B)}, 

Where Q is either the existential “3” or the universal “V” quantifier. Pre{Q,A,B) 
is the subset of states of A which have either at least one successor (2 = 3) or 
all their successors (Q = V) in the set B. Let / be a formula and F be the set of 
states that satisfy /. The sets of states EX and AX that satisfy EX (/) and AX (/) 
respectively are defined by: 

EX = Fre(3,{0,l}'”,F) (1) 

M = Fre(V,(0,l}'",F) (2) 

Let / and g be two formulas and F and G be the sets of states that satisfy / and 
g respectively. The sets of states EU and AU that satisfy the formulas E[f U g] 
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and A\f U g] respectively are the limits of the following converging sequences 
of set {Ek) and (A*) [3]: 

Eq = G, and Ei+i = £itUPre(3,F,£jt), (3) 

Ao = G, andA;t+i =A;tUPre(V,F,A*). (4) 

These algorithms use only the basic set operations U,n,=, in addition with the 
“Pre” operation. 

2.3 Comparison of Sequential Machines 

The fundamental method for comparing the observable behaviors of 9v[\ = 
(yi,/,G, 8 i,Xi,/mVi) and M 2 = {Y 2 ,I, 0 ,d 2 ,)^ 2 jnit 2 ) is to check that the output 
ok of the product machine fW = fWi x 51^2 is equal to 1 for every valid state of M. 
This machine is defined hy M = {Y,I,[ok],?t,X,Init) where Y = yi@T2, 
is the vector concatenation), 5 = 61 @ 82 , Init = Init\ x Init 2 and Xok depends on 
the comparison. 

The correctness property given above holds for the machine M iff the CTL 
state formula {-^E[True U {ok = 0)]) holds in its initial state Init, so the veri- 
fication algorithm presented in 2.2 can be used to compare two machines. We 
give here a specific comparison algorithm that has been shown by experience 
to be much more efficient than this general algorithm. Both methods will be 
discussed in Part 5. 

The idea used here is to compute the set Valid of all the valid states of the 
machines M. This set is the limit of the converging sequence of sets Vk defined 
by the equations: 

Vb = Init, m<iVk+i=VkliImg{ 8 ,VkX {0,1}''), (5) 

where Img{f,A) =def {/(«)|a 6 A} is the image of the set A with respect to 
the function /. Once Valid is computed, the verification comes down to testing 
whether 



Img{Xok, Valid) = {1}. ( 6 ) 

This algorithm, like the one given in the previous section, uses only the basic set 
operations, in addition to the “Img” operation. 

3. Boolean Functions and Sets 

Any subset A of {0, 1}" can be represented by a unique Boolean function X\ 
from {0, 1}" to {0, 1}, defined by: Xa{o) = 1 if and only if a € A. The function 
%A is called the characteristic function of A. The set operators (U,n, C, €, x) 
can be expressed in terms of the logical operators (V, A, -i, ■^). For instance, 

the characteristic function of the set A U 5 is (XACy) V X(y)). 
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Typed decision graphs [1] are a compact canonical representation of Boolean 
functions. They have remarkable properties that make the symbolic manipu- 
lations on Boolean functions very efficient. Typed decision graphs, which are 
binary decision diagrams [4] with typed edges, are the canonical graph rep- 
resentation associated to Shannon’s typed canonical form [1]. By associating 
a unique atom to each of the component of the cartesian product {0, 1}", any 
characteristic function can be represented by a unique typed decision graph. 

The correspondence mentioned above gives the computational cost of the 
elementary set operations. The negation on typed decision graphs has a null cost 
so this is the same for the set complementation. The others Boolean operations 
have a complexity in 0(|Gi | x IG 2 I), where 1G| is the number of vertices in the 
graph G, which gives the computational cost of the corresponding elementary 
set operations. 

There is no relation between the number of elements in a set and the number 
of vertices in the graph of its characteristic function. However there exists some 
subsets of {0, 1}" whose graphs have 0(2"/log(n)) vertices. Experience shows 
that, for most of the machines we deal with, while the sets manipulated by the 
verification algorithms are very large, the graphs of their characteristic functions 
stay small. This means that the computational cost of the basic set operations 
performed by the symbolic verification algorithms is low, and that the total cost 
of these algorithms depends on the costs of the operations “Pre” and “Img” 

4. The and Operations 

This part explains how the operations Tre” and “/mg” can be easily real- 
ized when the typed decision graph of the transition relation and of the output 
relation of the machine can be build [5]. Then it presents the techniques we 
have developed to perform these operations when it is not possible to built these 
graphs, which happens for most complex circuits. 

4.1 Using the transition and the output relations 

The transitions relation A of the machine Misdi subset of {0, 1}"" x {0, 1}” x 
{0, 1}”. For any states s and s' of and for any input pattern pattern p, (s, p, s') 
belongs to A if and only if s' = 5(s,p). For any subset A and B of {0, 1}'", the 
characteristic function of the set Pre{Q,A,B) is equal to: 



^Pre(QA,B)^^'^ ~ ’^a{a^A{Qp3s Xb{s) AX^{s,PiS)). 



The output relation of the machine noted A, is a subset of {0, 1}'” x {0, 1}*. 
For any state s of fM, and for any output pattern o, (s, o) belongs to A if and only 



44 



THE BEST OF ICCAD 



if 0 = X{s). The equations 5 and 6 , using the operation “/mg”, can be written as: 



Since the formula (3x/(x)) is equivalent to (/(O) V/(l)), and the formula 
{yxf{x)) is equivalent to (/(O) A /(!)), the graphs of the characteristic func- 
tions of Pre{Q,A,B), Img{d,A x {0,1}") and Img(k,A) can be directly com- 
puted from the graphs of X/i,Xb,Xa, and Xa by eliminating the quantified atoms 
associated with p, s and s'. This technique is very efficient [5], because it uses 
only the operators V and A, which have been shown in Section 3 to have rela- 
tively low computational costs. The problem is that, for complex machines, it is 
not possible to build the graphs of A and A [ 8 ]. 

4.2 The “Restrict” and “Expand” Operators 

The equations that define the "Pre” operation are the following: 



The graph of the function {3p X{6{s, p)), where X is Xb or -iXb, can be computed 
in two steps. The first step consists in computing the function Xo 8 . Then, the 
quantified atoms associated to the input pattern p are eliminated from the graph 
of this function. 

It has been shown [4] that substituting some variable v in the graph Gi with 
the graph G 2 has a computational cost in 0(|Gi p x IG 2 I). In order to compute 
the graph X(5i (s,p ),. . . ,6n,{s,p)), this basic substitution process must be iter- 
ated so that all variables of X are substituted by the graphs of the corresponding 
components of d. The problem is that during this composition, some intermedi- 
ate graphs can be too large to be built. 

Section 4.2.1 shows that it is not necessary to build the graphs of X 06 to 
compute Pre{Q,A,B). It presents the function “Expand” that avoids this con- 
struction. Section 4.2.2 presents the “Restrict” operator that further reduces the 
computational cost of the “Pre” operation by reducing the sizes of the graphs 
that are manipulated. 

4.2.1 The function “Expand”. The idea that underlies the func- 
tion “Expand” is to express the function Xo5 as a sum of K functions h\,...,hk, 
whose graphs have less vertices than the graph of X o 5. Using these functions, 
the term (3p X(5(s,p))) can be rewritten into (Vy(3p hj{s,p))). This identity 
allows us to eliminate directly the quantified atoms associated to the input pat- 
tern p form the graphs of the functions hi,...,hk, and so the graph of X o 5 does 
not have to be built. 



%ffig(8,A X {0,1 }")(■* ) ~ 3^ Xa(s) AXa(s,p,s ) 

'^Img(X,A) (^) ~ (^) A Xa (5, o) 



(7) 

( 8 ) 



^Pre( 3 AB)i^) = Xa( 5 )A( 3 pXb( 5 (j,p))), and 
^Pre(v,A,B)W = XA(5)A-n(3p-.XB(5(5,p))). 



(9) 

( 10 ) 
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Each path in the graph G of the function X starting from the root and leading 
to the leaf 1 defines a cube cj of the function X. This means that the function 
Cj o 5 can be taken as one function hj. The problems are that the function X 
can have cubes, and that even if its number of cubes is relatively small, 

many redundant computations will be made. 

The function Expand performs a top-down traversal of the graph G, and stores 
in each of its vertices the graph of the function Cv o 8 , where the function Cv is 
the sum of all the cubes represented by the paths starting from the root of G and 
leading to v. Each time the top-down traversal reaches the leaf 1, the function 
Expand produces one of the function hj. The function Cv associated to a vertex 
V is recursively computed using the functions Cw of the vertices w that point to v. 
Thanks to the sharing in the graph, partial results are factorized and redundant 
computations are avoided [ 10 ]. 

Experience shows that the graphs of the functions h\,...,hk generated by 
the Expand operation are smaller than the graph of X o 5. The time needed to 
compute each of these functions directly depends on the sizes of the graphs of 
the functions Sj. The next section presents a Boolean operator that can be used 
to reduce the sizes of the graphs used in the term X(5(s, p)). 

4.2.2 The Operator “Restrict”. In the equation 9 and 10, when- 
ever Xa{s) = 0 the characteristic function of Pre{Q,A,B) is also equal to 0. This 
means that, in the term Xb(5(s,p)) that occurs in the equations 9 and 10, the 
transition function 5 can be replaced with its restriction to the domain A. 

The “Restrict” operator, noted takes as input the typed decision graph 
of a boolean function / and of the characteristic function c of the set to which 
the function / must be restricted. The semantics of the Restrict operator [ 8 ], is 
given on Shannon’s canonical form in Figure 1. In this figure, c.root is the root 
of Shannon’s canonical form of c, and (c/ -la, c/a) is Shannon’s expansion [1] of 
the function c with respect to a. The main properties of the Restrict operator [9, 
10 ] are expressed by the following theorems. 



Theorem 1 For any Boolean functions f and C 7 ^ 0, if c{x) = 1 then (/ ])■ 
c){x)=f{x). 



Theorem 2 For any Boolean functions f and c^O, Shannon’s typed canonical 
form of (/ Jj. c) has at most the same number of vertices as that of f. 



Theorem 2 is not true for typed decision graphs. It can happen that the graph of 
(/ ]]. c) has more vertices than the graph of /. In this case the function restrict 
returns the graph of /. Experience shows that this case occurs very rarely. 
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function restrict(/, c); 

ifc = 0 then error; 

if c = 1 then return /; 

if / = 0 or / = 1 then return /; 

\tia = c.root in { 

if cj-^a = 0 then return restrict(//a,c/a;) 
if cja — 0 then return restrict(//-ia,c/-ia); 
if f/->a = fja then return restrict(/,c/-iaV c/a); 
return {-->aArestrict{f/->a^c/->a)) V {aArestrict{f/ayc/a))\ 

} 



Figure 1. The Restrict Operators on Shannon’s canonical form. 



4.3 Performing the Operation 

The problem addressed here is to compute the characteristic function X of the 
image of the restrictions of a vectorial Boolean functions F = [f\cdotsf,] to a 
domain defined by its characteristic function A definition of X is [8]: 

j 

The term {/\j{yj fj{x)) represents the transition relation of the machine, 
which has been shown to be in many cases too complex to be computed [8]. 
This section presents two algorithms that can be used to compute X without 
computing this term. 

Both algorithms are bases on the “constrain” operator, noted “4-”, and work 
in two steps [8]. The fist step common to both algorithms consists in computing 
a new vectorial function F' = [/{ •••/,;] such that Img{F,%A) = Img{F', 1). The 
second step then consists in computing the characteristic function of Img{F, 1), 
by using co-domain partitioning in the first algorithm and domain partitioning 
in the second algorithm. 

Fig. 4.3 gives the semantics of the operator [9] on Shannon’s canonical 
form. The operator applies on Shannon’s canonical forms of the Boolean func- 
tions / and c, and produces Shannon’s canonical form of the function (/ 4- c). 
Its fundamental properties are expressed by the following theorem [9]. 

Theorem 3 Let F = [/i • • • /„] be a vector of functions, and c be a function 
different from 0. Let F \.c =def [(/i 4- c) •••(/« 4- c))]. Then Img{F I c, 1) = 
Img{F,c). 



4.3.1 Co-domain Partitioning Based Algorithm. The first 
recursive algorithm that computes Img{F, 1) uses the operator to partition 



Functional Verification 



47 



function cnst(/, c); 

ifc = 0 then error; 

if c = 1 then return /; 

if / = 0 or f= \ then return /; 

lei a = c. root in { 

if cj-^a = 0 then return cnst(//a,c/< 2 ;) 
if c/a = 0 then return cnst(//-'a,c/-ia); 

if fl~^a = ffa then return {~^a/\cnst{f^cl-^a)) V {aAcnst{f^c/a)); 
return {->a A cnst{f/ ->a, c/->a)) V (a A cnst{f/a, c/a)); 

} 



Figure 2. The Constrain Operator on Shannon’s canonical form. 

the co-domain of the vectorial function F. The algorithm is a great application 
of the following theorem. 

Theorem 4 Let F„ = [fi--- f„] be a Boolean vectorial Junction. Then: Img{Fn, 1) 
= i -t/„, 1) X {0}U//ng(F„_i x {1}. 

The number of recursions needed to compute lmg{F, 1) is bound by the number 
of elements of this set. Several techniques have been proposed to reduce this 
number of recursions. Vector partitioning [9] consists in splitting the vector F 
into several sub-vectors of functions which have disjoint supports of variables. 
In [6] it has been proposed to use a cache where partial results obtained during 
previous recursions are stored. However the exact matching used in [6] can 
be replaced by an extended matching that allows us to match any vector of k 
components in the cache with (A:! x 2*) vectors, with a complexity in O(^log^). 
This extended matching test is based on the following properties. 

Theorems If% is the characteristic Junction of Img{[fi ■ ■ ■ fk],A), then the 
characteristic function of Img{[ei{fi) ■ • • Ek{fk)],A), where Zj is the identity or 
the negation, is X(ei(yi), . . .,tk{yk))- 

Theorem 6 lf% is the characteristic function of Img{[fi---fk],A), then the 
characteristic function of Img{[f(,(^\y ■ f(,{k)],A), where a is a permutation of 
the k first integers, is X(ya(i), . . . ,ya(i))- 

4.3.2 Domain Partitioning Based Algorithm. The algorithm 
based on domain partitioning is a direct application of Theorem 7. The tech- 
niques described above can also be used to reduce the number of recursions 
needed to compute the result. This number is bounded by llj\fj\, but we think 
that it is directly related to |F|, where |F| is the number of vertices in the graphs 
that represent F. Note that this algorithm does not create any vertex. Except for 
characteristic functions. 



48 



THE BEST OF ICC AD 





#reg 


#in 


depth 


#valid 


^codp 


Idp 


s838 


32 


35 


17 


17 


2.4 


2.5 


mdc 


11 


11 


14 


35 


2.3 


2.6 


scf 


8 


27 


16 


115 


6 


5.8 


s298 


14 


3 


19 


218 


2.9 


2.8 


s713 


19 


35 


7 


1544 


81.7 


88.2 


s344 


15 


09 


7 


2625 


32.7 


28.6 


s382 


21 


03 


151 


8865 


128 


79 


s444 


21 


03 


151 


8865 


126 


75.4 


cbp\6 


16 


17 


2 


6.5 E4 


1.2 


1 


cbp32 


32 


33 


2 


4.3 E9 


5.7 


4.5 


key 


56 


62 


2 


7.2 E16 


15.5 


4.6 


stage 


64 


113 


2 


1.8 E19 


> 10000 


1242 


sbc 


28 


40 


10 


1.5 E5 


> 10000 


3530 


sync 


21 


4 


20 


1469 


140 


154 


clml 


33 


13 


396 


3.8 E5 


1227 


2620 


dm2 


32 


382 


279 


3.3 E6 


8520 


> 10000 


mmlO 


30 


13 


4 


1.8 E8 


834 


32 


mm20 


60 


23 


4 


1.9 E17 


> 10000 


212 


mm30 


90 


33 


4 


2E26 


> 10000 


760 



Table 1. Valid States Computation. 



Theorem 7 Let F — [/i • • • /^] be a Boolean vectorial function. Then: 
tfng{[f\ • ••/„], 1) = /mg([/i/-ia- • •/„/-.«], 1] U/mg([/i/a- • -/„/a], 1). 

5. Experimental Results - Discussion 

The algorithms are written in LISP, and the CPU times in seconds are for 
a BULL DPX5000 mini computer. Figure 3 gives the CPU times needed to 
compute the set of valid states of some digital circuits. For all circuits, #in is 
the number of inputs, #reg the number of state variables, depth is the number of 
iterations, #valid is the number of valid states , tcodp and tdp are the CPU times 
for the algorithm based on co-domain partitioning and for the algorithm based 
on domain partitioning respectively. 

There are circuits that can be treated by only one of the two algorithms. The 
circuit dm2 can be treated only with co-domain partitioning: at each step during 
the computation, only a few states are reached. The MinMax [9] circuits mm20 
and mm30 can be treated only with domain partitioning: the hit ratio in the cache 
is very high for this algorithm, which is not the case for the other algorithm, and 
the number of states that are reached at each step is very large. 
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The symbolic verification algorithm of temporal properties has been applied 
to the machines clml and sync. The property to be proved valid on clml re- 
quired the computation of only one fixed point that took 38 steps and 4000s of 
CPU time to be obtained. The restrict operator was very useful since it reduced 
the graphs used to compute the term Xb(5(s,p)) in such a way that during the 
iteration, none of them had more than 186 vertices, while some of the graphs 
that represent the transition function of clml have more than 2500 vertices. The 
property to be verified on sync was Init |= AG{OK =1). The CPU time needed 
to make this verification was 4160 seconds and the fixed point was found in 9 
steps. The restrict operator was not very useful. 

Note that the property AG{OK = 1) can be verified using the algorithm pre- 
sented in Section 2.3 [10], and the verification times is then the time needed 
to compute the valid states of sync, which is very much smaller. This is quite 
understandable since the “Pre” operation is intrinsically more complex than the 
“7mg” operation [10]. 

6. Conclusion 

In this paper we have shown that the two kinds of verification that are needed 
to design correct sequential circuits can be treated in a unified framework. We 
have presented verification algorithms that manipulate sets of inputs represented 
by the typed decision graphs of their characteristic functions. 

Though these algorithms are very efficiently they do not allow us to deal di- 
rectly with the complex circuits designed at BULL. This means that some tech- 
niques are still needed to exploit the full power of this kernel. These techniques 
control and data, and the decomposition of the verification task according to the 
circuit structure. 
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Abstract 

The Ordered Binary Decision Diagram (obdd) has proven useful in many applications as an ef- 
ficient data structure for representing and manipulating Boolean functions. A serious drawback 
of OBDD’s is the need for application-specific heuristic algorithms to order the variables before 
processing. Further, for many problem instances in logic synthesis, the heuristic ordering algo- 
rithms which have been proposed are insufficient to allow OBDD operations to complete within a 
limited amount of memory. In this paper, I propose a solution to these problems based on having 
the OBDD package itself determine and maintain the variable order. This is done by periodically 
applying a minimization algorithm to reorder the variables of the OBDD to reduce its size. A 
new OBDD minimization algorithm, called the sifting algorithm, is proposed and appears espe- 
cially effective in reducing the size of the OBDD. Experiments with dynamic variable ordering on 
the problem of forming the obdd’s for the primary outputs of a combinational circuit show that 
many computations complete using dynamic variable ordering when the same computation fails 
otherwise. 



1. Introduction 

Boolean function manipulation is an important component of many logic syn- 
thesis algorithms including logic optimization and logic verification of combi- 
national and sequential circuits. The Ordered Binary Decision Diagram (OBDD) 
has proven useful in these applications as an efficient data structure for the repre- 
sentation and manipulation of Boolean functions. However, a serious drawback 
of obdd’s is the need to order the variables. 

When an order is found which keeps the obdd size manageable (e.g., less 
than 100,000 nodes), OBDD-based techniques perform very well. However, it 
is usually necessary to devise heuristics to order the variables for each OBDD 
application. Besides the burden this places on the programmer trying to apply 
obdd’s in a particular setting, there is the problem that in many instances the 
heuristic algorithms which have been proposed are unable to find a variable 
order which keeps the obdd’s small. This implies that many OBDD applications 
give-up (space-out) before an operation can be completed. 

Many OBDD applications in logic synthesis begin by forming the obdd for 
each primary output in a combinational circuit (in terms of the primary inputs 
to the circuit). For many problem instances, choosing a random order for the 
variables leads to obdd’s which are too large. For example, it is not possible 
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to form the obdd’s for 23 of 35 large circuits from the iwls’91 benchmark set 
when using a random variable order. Heuristic ordering algorithms, such as the 
depth-first heuristic algorithm and its variations ([10, 5, 11]), are a significant 
improvement over random variable ordering. However, it is still not possible 
to form the obdd’s for 11 of 35 large circuits from the iwls’91 benchmark 
set when using this heuristic order. It has remained an open problem whether 
this is a limitation of the variable ordering algorithms or simply the inherent 
exponential worst case complexity of the OBDD representation. 

This paper describes a general paradigm to improve the robustness of any 
OBDD package by using automatic variable re-ordering. First I review the nec- 
essary details of an OBDD package and then describe the dynamic variable or- 
dering strategy. Two OBDD minimization algorithms are presented, including 
a new algorithm called the sifting algorithm. Experimental results are given 
for these algorithms which demonstrate the utility of dynamic variable order- 
ing when applied to the problem of forming obdd’s for the primary outputs in 
a combinational logic network. The last section provides directions for future 
improvements of this technique. 

2. OBDD Implementation Review 

I assume that the reader is familiar with Ordered Binary Decision Diagrams as 
introduced by Bryant [3]. In the paper by Brace, Rudell and Bryant [2], details 
for an efficient implementation of a OBDD package were outlined. I review here 
some details necessary for the remainder of the paper. 

A multi-rooted (shared) directed acyclic graph (DAG) is used to represent a 
set of Boolean functions. Each node in the DAG represents a Boolean function F 
and has an associated variable x, and pointers to two other nodes (functions) in 
the DAG. The node F is written as the tuple (x,, G,H) where x, is called the top 
variable of the function F, G is the positive cofactor of F with respect to x,- (G = 
Fx^\ and H is the negative cofactor of F with respect to x, {H = 7^,). The node F 
thus represents the function F = x,G+ x,H. G is also known as the then node 
and H is also known as the else node. The sink nodes represent the constant 
functions 0 and 1. 

Ordered BDD’s have a total order imposed on the variables; i.e., an order 
is assigned to each variable, and the variables must appear in ascending order 
along every path in the OBDD. Because the BDD is ordered, the DAG can be 
levelized with all nodes with a particular top variable at a given level. Level i 
refers to all nodes with a top variable x;. 

A global hash table, called the unique table, allows a node of the DAG, 
{xi,G,H), to be found in constant time. A hash function is computed on the 
tuple (xi,G,H) which provides an index into an array of bins which store the 
first DAG node for that hash value. All of the nodes with the same hash value 
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(i.e., collisions) are stored in a linked list. Each node of the DAG occupies 4 
words: the variable index plus other flags, a pointer to the then node, a pointer 
to the ELSE node, and a pointer to the next node on the collision chain for the 
unique table. 

A global cache, called the computed table, is used as a memory function for 
the recursive algorithms which operate on the DAG. This table, implemented 
as a hash-based cache, stores the results of recursive operations such as ITE, 
but overwrites an entry when a collision occurs, rather than using a link-chain 
to resolve collisions. Each hash-based cache entry occupies 4 words: 3 words 
which form the key for the operation (e.g., ITE(F,G,H) uses F, G, and H as the 
key) and a single word which is the operation result. A ratio of one cache entry 
is maintained for every four unique table entries so that the total memory usage 
of the package, including all overhead, is approximately 24 bytes per DAG node 
on a 32-bit machine. 

The OBDD package uses garbage collection to recycle memory. A reference 
count is maintained for each node in the DAG, and a count of the number of 
dead (i.e., unreferenced) nodes in the DAG is maintained as nodes are created 
and freed. Dead nodes cannot be freed immediately because an entry from the 
computed table may point to the node; these references are not included in the 
reference count because the computed table entries are never deleted. During a 
recursive operation such as iTE, a find or add operation is performed to either 
find a node in the dag or to create a new node if the given node does not exist. 
If a new node is created causing the unique table to become too full, then either 
a garbage collection is performed if there are enough dead nodes in the dag to 
make it worthwhile, or the unique table array is increased in size by a factor of 
two. 

The key drawback to this package, which this paper addresses, is that the 
ordering of the variables is specified in advance by the user. No assistance is of- 
fered to help order the variables, and the order cannot be subsequently changed. 

3. Dynamic Variable Ordering 

In this paper, I propose a general paradigm for maintaining variable orders in 
an OBDD. The idea is to have the OBDD package determine and maintain the 
variable order of the OBDD. This variable order is changed automatically by the 
OBDD package, transparently to the user, as operations are performed. Because 
the variable order within the OBDD is no longer static, this technique is referred 
to as dynamic variable ordering. 

Dynamic variable ordering differs from the typical use of obdd’s where the 
variables are ordered once when the OBDD is created and the order is maintained 
throughout all subsequent processing. It also differs slightly from other pro- 
posed variable ordering schemes in that the re-ordering is not performed at the 
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explicit request of the user. Instead, the package determines appropriate points 
at which to stop processing, choose a new order, and then resume processing. 

When using dynamic variable ordering, a total order is defined for all vari- 
ables before and after each package operation; however, the order is periodically 
adjusted by the obdd package, as a consequence of an operation, to find a bet- 
ter order. Logically, the variable order changes in-between package operations. 
Thus, we maintain all advantages provided by the ordered BDD data structure, 
such as canonicity and efficient recursive algorithms. 

A well-designed interface to an obdd package hides all details of the obdd 
data structure. The programmer simply creates the variables and then uses pack- 
age operations such as and, OR, and not to form new functions. This allows 
dynamic variable ordering to be applied transparently to the user of the obdd 
package. 

There are two goals we hope to achieve with dynamic variable ordering. The 
first goal is to allow obdd operation sequences which fail when using a fixed 
heuristic variable order to succeed when a new order is chosen mid-stream. The 
second goal is to reduce the need for the heuristic ordering algorithms - i.e., 
problems which complete with a heuristic variable order should also complete 
when starting from a random variable order. If we can deliver on these goals, 
the advantage of dynamic variable ordering to the user of the OBDD package is 
clear. 

One implementation of dynamic variable ordering is as follows. At each 
garbage collection within the obdd package, which is triggered based on the 
growth of the number of nodes in the dag, a variable-reordering algorithm is 
applied to the obdd to reduce the OBDD size. Any variable ordering algo- 
rithm can be applied at this step, but the algorithms which are used must be 
efficient because they will be applied repeatedly as OBDD processing proceeds. 
Of course, the algorithms must also be effective in finding a better variable order 
for the OBDD. 

Dynamic variable ordering motivates the exploration of algorithms for obdd 
minimization; i.e., reducing the size of all functions simultaneously represented 
by a multi-rooted obdd by changing the variable order. Two algorithms for 
OBDD minimization are considered in the next section. 

4. Variable Reordering Algorithms 

Many people, including Brace [1], Fujita et al. [6], and Ishiura et al. [7] have 
made the observation that swapping the order of two adjacent variables in an 
OBDD affects only the DAG nodes at the two levels; all other nodes remain un- 
changed. In this section, I describe how to implement this operation so that 
its complexity is proportional to the number of nodes at the particular level of 
the DAG and independent of the size of the entire DAG. This efficient adjacent 
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variable swap forms the core for many OBDD mini m ization algorithms. I then 
describe the window permutation algorithm and the sifting algorithm for mini- 
mizing the size of the OBDD. 

4.1 Efficient Variable Swap 

There are two problems with making the variable swap of x, and have 
local complexity. The first is that we need to find all nodes at level i without 
walking the entire DAG starting from the roots. The second is that each node in 
the DAG must represent the same function before and after the variable swap to 
avoid patching any references to that node. 

A memory-efficient scheme to find all nodes at level i replaces the single 
unique table with an array of hash tables, one per level of the dag. The variable 
index is used to locate the hash table which stores all nodes for that level of the 
DAG. The hash table has an array of bins which store the first node for each hash 
value for this level. Hence, all nodes at a given level can be visited by walking 
the collision chain which starts at each hash table array position. 

A node F at level i can be pointed to by other nodes above it in the DAG, and 
by functions which have already been returned to the user. To reduce memory, 
back-pointers are not maintained. Hence, there is no way to reach all references 
to node F without walking the entire dag. Therefore, to perform a local variable 
swap, it is necessary to maintain an identical logical function at each node. This 
is done by overwriting the node representing F with the new node which results 
from the modification to the variable order. This is done as follows. 

Let F = {xi,F\,Fo) be a node at level i. Let Fn be the cofactor of F\ with 
respect to x,+i. Computing this cofactor is trivial: the result is either the then 
node pointed to by F\ (if .x,+] is the top variable of F\) or F\ (otherwise). Simi- 
larly, let Fio be the negative cofactor of Fi, and let Foi, Fqo be the two cofactors 
of Fq. Node F is overwritten with the tuple (x,+i,(jCi,Fii,Foi),(x;,Fio,Foo)). 
Expansion of this formula shows that it preserves the function of node F and in- 
spection ensures that the new variable order (i.e., x,+i is above Xi) is established 
for all paths through F. 

The new nodes required at level i (i.e., (.x,-,Fii,Foi) and (j:/,Fio,Foo)) may 
be degenerate nodes (e.g., in the case that Fn = Foi), or may already exist in 
the DAG as required to implement other functions. When F is re-expressed as 
a result of the variable swap, the dag’s rooted at F\ and Fq can be freed. Note, 
however, that the nodes Fqo, Foi, ^lo. ^ii all have references after the variable 
swap, so that only the root nodes Fi and Fq can be freed as a result of the swap. 
To be specific, node F\ can be freed if the only reference to Ft previously came 
from node F. 

We can make use of this observation to perform incremental garbage collec- 
tion during the variable swap. Before obdd minimization is applied, a garbage 
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collection is performed and the computed table is cleared. Thereafter, nodes at 
level I + 1 can be deleted incrementally if they have no other reference beside 
the reference from level i. 

Attributed edges have been proposed by, among others, Karplus [8], Madre 
and Billon [9], and Minato et al. [11]. The edges in the OBDD are tagged to 
indicate a modification of the referenced function. This reduces the size of the 
DAG by allowing a single node to represent several different functions. The most 
popular and useful attribute is the negate-output edge, although other attributes, 
such as negate-input edge [11] and negate-else edge [4] have been proposed. 
The inclusion of attributed edges is usually transparent to the algorithms which 
operate on the OBDD. In particular, the complexity of an adjacent variable swap 
is unaffected by the inclusion of negate-output edges. However, inclusion of 
either negate-input or negate-else edges appears to destroy the local complexity 
of a variable swap by requiring pointers to a modified node to be changed. (More 
details can be found in [12].) One impact of dynamic variable ordering, then, is 
that these last two edge attributes cannot be used. 

4.2 Window Permutation Algorithm 

Fujita et al. [6] and Ishiura et al. [7] presented similar heuristic algorithms for 
minimizing the size of a OBDD using adjacent variable exchange. I refer to this 
algorithm as the window permutation algorithm.. 

The window permutation algorithm proceeds by choosing a level i in the dag 
and exhaustively searching all permutations of the k adjacent variables start- 
ing at level i. This is done using ^! - 1 pairwise exchanges followed by up to 
k{k - 1)/2 pairwise exchanges to restore the best permutation seen. This is then 
repeated starting from each level until no improvement in the dag size is seen. 
Figure 1 shows the variable permutations which are explored when applying a 
window of size ^ = 3 starting at variable X 2 . Six permutations are explored with 
5 adjacent variable swaps, and then 3 additional variable swaps (worst case) are 
used to restore the best permutation. 



X\,X2,X^,Xi 

X\,X-i,X2,Xi 

Xl,X3,Xi,X2 

Xl,X4,X3,X2 

Xl,X4,X2,X3 

X\,X2,Xi,X3 



X5,X6,Xi 
,Xs,X6,X2 
X5,X6,X2 
X5,X6,X2 
X5,X6,Xj 
^5 ) ^6 ) ^1 



initial 

swap (x2,JC3) 
swap (jC2,X4) 
swap (a: 3 ,jc 4) 
swap [X2>,X2) 
swap {x^,X 2 ) 



•^1 

•^1 >•^3 )-^ 4)*^2 



I *^5 ) -^6 ) *^7 
I *^5 ) -^6 ) 



swap (x4,JC3) 
swap {x 2 ,X 3 ) 
swap [X 2 ,X 4 ) 



Figure 1. Window Permutation Example. 
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Marking can be used to record when a variable exchange at a given level 
may be profitable. A level is marked after the permutation at the level is known 
optimal. This mark is reset when a new permutation is determined for any of 
the preceding k-l levels; when all levels in the DAG are marked, the window 
permutation algorithm cannot further improve the DAG size). 

Because the swap of two adjacent variables is efficient, the window permu- 
tation algorithm remains practical for values of k as large as 4 or 5. However, 
results presented in Section 5.2 and Section 5.3 indicate that the window permu- 
tation algorithm is limited in its ability to find good variable orders. 

4.3 Sifting Algorithm 

I propose a new OBDD m i n imization algorithm in this paper which I call the 
sifting algorithm. This algorithm is based on finding the optimum position for 
a variable, assuming all other variables remain fixed. If there are n variables in 
the DAG (excluding the constant level which is always at the bottom), then there 
are n potential positions for a variable, including its current position. Among 
these n positions, the subgoal employed by the sifting algorithm is to find the 
spot which minimizes the size of the DAG. 

Ideally, we could find the best position for a variable assuming all other vari- 
ables remain fixed with a low-complexity analysis of the OBDD. However, this 
does not appear possible. Therefore, the optimum position for a variable is de- 
termined by brute-force enumeration as follows. The variable is exchanged with 
its successor variable until the variable becomes the next to last variable in the 
DAG; i.e., the variable is sifted down to the bottom of the DAG. Then the vari- 
able is exchanged with its predecessor variable until the variable becomes the 
top variable in the dag; i.e., the variable is sifted up to the top of the dag. 
The best DAG size seen during this search is remembered and the position of 
the variable is restored by moving the variable from the top position down to its 
optimum position. Figure 2 shows the variable permutations which are explored 
when applying the sifting algorithm to variable x^. The 7 positions for variable 
X 4 are explored using 9 adjacent swaps, and the optimum position is restored 
with an additional 6 swaps (worst-case). 

The sifting algorithm proceeds as follows. The variables are sorted into de- 
creasing size based on the number of nodes at each level of the DAG. Then each 
variable is moved to its locally optimum position assuming that all other vari- 
ables remain fixed. Each variable is moved only once in this process, although 
the algorithm could be iterated to convergence. 

The sift algorithm has the advantage that a variable can move a long distance 
in the ordering. Note that the DAG-size can increase significantly after the first 
few variable swaps, and then eventually reduce below the starting point. This 
allows a type of up-hill move to be taken - the acceptance of the entire sequence 
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XuX2,X3,X4,Xs,Xe,Xj 
XUX2,X3,X5,X4,X6,X7 
XUX2,X^,X5,X6,X4,X'J 
XuX2,X3,Xs,X6yXT,X4 
X\ )X2jX2jX^jX^jX4j X'] 
XUX2,X3,X5,X4,X6,XJ 
XuX2,X2,X4,X5,Xe,X-! 
XuX2,X4,X3,X5,Xe,X'j 
XUX4,X2,X3,X5,X6,X’J 
X4,XUX2,X3,X5,X6,X^ 

XuX2,X4,X3,Xs,X6,X'7 

XuX2,X3,X5,X4,Xe,X'] 
■^1 ) X2 ) *^3 ) -^5 ) *^6 ) -^4 j X'j 
XuX2,X3,X5,Xe,XT,X4 



initial 

swap (jC4.-^s) 
swap {X 4 ,X(,) 
swap {x 4 ,x^) 
swap {XT,X 4 ) 
swap {X6,X4) 
swap {xs,X 4 ) 
swap (X3,X4) 
swap (X2,X4) 
swap (xi,X 4 ) 
swap (;c 4 ,jci) 
swap (X 4 ,X 2 ) 
swap (j: 4 .JC 3 ) 
swap (^4.-*5) 
swap {X4,X6) 
swap (X 4 ,X 2 ) 



Figure 2. Sifting Algorithm Example. 



of pairwise swaps is based on the best position seen regardless of any increase 
in the intermediate dag size. A limitation of the window permutation algo- 
rithm appears to be that several moves can be required to move a variable a long 
distance and these moves can be blocked by an intermediate up-hill move. 

The sift algorithm requires 0{n^) swaps of adjacent levels in the DAG, and 
each of these variable swaps has complexity proportional to the width of the 
DAG. To control the worst-case complexity, the search in a particular direction 
is terminated if the DAG size grows to twice its original size. 

5. Experimental Results 

The iwls’91 benchmark set includes a directory of 76 combinational cir- 
cuits (cmlexamples) and 40 sequential circuits (smlexamples). This includes 
the iscas’85 and ISCAS’89 benchmarks. Because most of these circuits are 
trivially small, I focus here on the 35 largest multiple-level examples. All DAG- 
sizes are given in thousands of nodes, and the mn-times are measured on a Sun 
Microsystems SparcStation-10 Model 41. 

Due to space limitations, only summary results are presented in this paper. 
Complete tables of results appear in [12]. 

5.1 Random Orders vs. Heuristic Orders 

The first experiment was to form the obdd’s for all primary outputs. The 
same variable order was used for all of the primary outputs. The variable order 
was first determined with a single random trial, and then using the depth-first 
heuristic ordering algorithm. 
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When the maximum DAG size was set to 100,000 nodes (2.4 mB memory), 
it was not possible to form the obdd’s for 23 of the 35 circuits when using 
a random variable order, and it was not possible to form the obdd’s for 11 
of the 35 circuits when using the depth-first heuristic ordering algorithm. The 
1 1 circuits which failed when using a depth-first ordering were c2670, c3540, 
c6288, c7552, ilO, mm9a, mm9b, mm30a, s9234.1, si 5850.1, ands38417. 

When the maximum DAG size was increased to 1,0(X),000 nodes (24 mB 
memory), the random order failed for 13 of the 35 circuits and the heuristic 
order failed for 7 of the 35 circuits. The 4 additional circuits which completed 
when given more memory were c3540, ilO, s9234.1, and si 5850.1. 

5.2 OBDD Minimization Comparison 

The next experiment compares the window permutation algorithm against the 
sift algorithm for OBDD minimization. The obdd’s for 24 of the 35 examples 
can be formed using the heuristic ordering algorithm and a 100,000 node limit. 
These 24 circuits were minimized after the obdd’s had been formed with the 
window permutation algorithm for ^ = 2, 3, 4, 5 and the sifting algorithm. The 
relative DAG-size and CPU ratios for each minimization algorithm is given in 
Figure 3. 



Algorithm 


size 


cpu 


No minimization 


1.00 


1.00 


Window, k=2 


0.81 


1.19 


Window, A:=3 


0.72 


1.49 


Window, ^=4 


0.70 


2.83 


Window, ^=5 


0.67 


9.19 


Sift 


0.55 


3.84 



Figure 3. Minimization Comparison. 



The sift algorithm results in obdd’s which are 45% ^ smaller than the heuris- 
tic order, while the window permutation algorithm with ^ = 4 produces obdd’s 
which are only 30% smaller. The sift algorithm produces obdd’s which are 
20% smaller than the window permutation algorithm {k = 4) at the cost of an 
additional 40% in run-time. 

5.3 Dynamic Variable Ordering 

The next experiment compares the window permutation algorithm and the 
sifting algorithm in the context of dynamic variable ordering. For this experi- 
ment, I focused on the 11 examples which cannot complete with the heuristic 
order. Dynamic variable ordering was performed by applying the corresponding 
OBDD minimiz ation algorithms at each garbage collection, and the measurement 
criteria was to see if the examples could complete. The results are given in Fig- 
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ure 4 where FAIL refers to the number of primary outputs which failed to have 
their obdd formed. 
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Figure 4- Results for 11 Hard Examples. 



Dynamic variable ordering using the window permutation algorithm (^=4) 
was able to complete 3 of the 11 failed examples {mm9a, mm9b, and si 5850.1). 
Dynamic variable ordering using the sifting algorithm was able to complete 9 of 
the 11 failed examples. Only c6288 and s38417 still fail with the 100,000 node 
limit. Even for these two failed examples, fewer outputs failed when the sifting 
algorithm was used. 

5.4 Performance Impact 

The last experiment compares the run-time for the 35 largest circuits without 
dynamic variable ordering, with dynamic variable starting from the heuristic 
variable order, and with dynamic variable ordering starting from a randomly 
generated variable order. The results are summarized in Figure 5. The sift algo- 
rithm was used when performing dynamic variable ordering. 



Algorithm 


size cpu 


No minimization 

DVO/Heuristic 

DVO/Random 


1.00 1.00 

0.56 6.80 

0.58 11.80 



Figure 5. Performance Results. 



For the 24 examples which complete both with and without dynamic variable 
ordering, the run-time was increased an average of 6.8 when using dynamic 
variable ordering starting from the heuristic order. The average OBDD size for 
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these 24 examples was reduced by a factor of 1.8 (45%) when dynamic variable 
ordering was used. 

Figure 5 also compares dynamic variable ordering starting from the heuristic 
order (DVO/heuristic) to dynamic variable ordering starting from a randomly 
generated variable order (DVO/random). The sifting algorithm was still able 
to complete for all but 3 examples (c6288, mmSOa, and s38417). Interestingly, 
the DAG sizes starting from the random order were only slightly larger than the 
DAG sizes when starting from the heuristic order while the run-time increased 
by almost a factor of 2. 

5.5 Summary 

The sift algorithm is superior to the window permutation algorithm, both as 
a static OBDD minimization algorithm and in the application of dynamic vari- 
able ordering. The sift algorithm consistently produces smaller obdd’s than the 
window permutation algorithm, although it has run-times which are longer. 

The application of the sift algorithm in conjunction with dynamic variable 
ordering allows 9 of the 1 1 combinational circuits which could not complete 
using a static variable order to complete. 

When a random variable order was used as the starting point rather than the 
heuristic variable order, OBDD processing was still able to complete for 32 of the 
35 largest examples, including 8 of the 1 1 difficult examples. While some im- 
provement still exists when starting from the heuristic order, for most examples 
a random order does not affect the ability of the OBDD processing to complete. 

These results indicate that the goals of completing more computations and 
reducing the dependence on the need for heuristic ordering algorithms have been 
achieved for this application. 

6. Conclusions 

This paper proposes a modification to an OBDD package whereby the obdd 
package owns and maintains the order of the variables. At each garbage col- 
lection, an algorithm is applied on the obdd to reorder the variables so as to 
reduce the number of nodes in the OBDD. Two OBDD minimization algorithms 
were tried; the window permutation algorithm and the sifting algorithm. Little 
benefit was seen from the application of the window permutation algorithm for 
k = 4. Dynamic variable ordering using the sifting algorithm was able to com- 
plete several obdd operations which were not able to complete without dynamic 
variable ordering, and the resulting obdd tended to be significantly smaller. In 
almost all cases, dynamic variable ordering was able to complete the OBDD op- 
eration sequences even when starting from a random variable order rather than 
a heuristically determined variable order. The drawback of dynamic variable 
ordering is that the runtime for the obdd operations increases significantly. 
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7. Future Directions 

One direction for this work is to investigate dynamic variable ordering when 
applied to other OBDD problems. For example, the fixed-point algorithm found 
in sequential verification has the problem that determining a good variable order 
a priori for both the transition relation OBDD and the state space OBDD is diffi- 
cult. It would be interesting to see if dynamic variable ordering could improve 
the efficiency and application of these algorithms. 

The utility of dynamically determining the variable order for the obdd has 
been demonstrated; however, the run-time impact is very large. One idea to 
improve the sifting algorithm would be to devise a more efficient algorithm to 
determine the optimum position for a variable (assuming all other variables re- 
main fixed). Another idea would be to explore exact bounding techniques that 
determine when a search in a particular direction can be terminated. Also, ex- 
ploring completely different algorithms for OBDD minimization is a possibility. 

Finally, it would be interesting to determine bounds on the growth of the 
OBDD as a single variable is moved ±k positions. 
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Notes 

1. Averages for the benchmark set are computed as the arithmetic mean of the ratio computed for each 
example. 
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Abstract 

The problem of checking equality of Boolean functions can be solved successfully using existing 
techniques for only a limited range of examples. We extend the range by using a test generator 
and the divide and conquer paradigm. 



1. Introduction 

Since the synthesis process, whether automatic or manual, cannot be guaran- 
teed to be error-free, there is a need for checking its correctness. That is, we have 
to prove the equality F = G, where F is the input into synthesis and G is the re- 
sult of synthesis. The traditional approach to this problem is to build a canonical 
BDD representation [6] for each function and then simply compare the BDDs 
for identity. Many variations and improvements have been made [2, 7, 10, 14] 
so as to extend the range of applicability. The applicability of BDDs is lim- 
ited because their memory requirements may grow exponentially with the size 
of the function. Therefore “running out of memory” is the usual failure mode 
of BDDs. This is more serious than the failure mode of “running out of time” 
because it is much easier to allocate more time to a problem than more memory. 
This has been recognized by [9] where correctness can be guaranteed with a 
reliability increasing with increasing CPU time. 

One method of Boolean reasoning that does not explode in terms of memory 
is a test generator. The most straightforward approach towards proving F = G 
using a test generator would form the exclusive OR, F © G (Figure 1), and then 
ask whether the output of the XOR gate is testable. If it is testable for stuck 
at 0 then the resulting pattern constitutes a counter-example to the assertion of 
functional equivalence. If it is not testable for stuck at 0 then the two func- 
tions are identical. If it is not testable for stuck at 1 then the two functions are 
complements of each other. 

Throughout the paper we will use the term “miter” to refer to the configu- 
ration of Figure 1. In general a miter can appear in the middle of larger logic, 
where it is defined to consist of a two-input XOR gate, plus the symmetric set 
difference between the transitive fanins of the two inputs into the XOR gate. In 
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other words, a miter starts at an XOR gate and extends towards primary inputs 
till nets shared by both cones of logic. 




Figure 1. A miter. 



The above straightforward approach is not likely to be successful because test 
generators tend to have a lot of difficulty with miters. While there are test gener- 
ation strategies specifically designed for miters [8], they are bound to fail if the 
miter is big enough. The main observation of this paper is that the difficulty for 
a test generator is dependent mainly on the size of the miter, rather than on the 
size of the surrounding logic. The key to our approach is to keep applying a test 
generator to configurations where the size of the miter is small and independent 
of the size of the functions subject to verification. 

One way of making the verification problem easier is to do it in stages [5]. 
Suppose that / is a sub-function of F{f) and g is a sub-function of G{g). 
We would like to first prove f = g and then use it to simplify the proof of 
F{f) = G{g). To do that we must first solve two problems. 

1. It may be difficult to find such sub-functions / and g because synthesis fre- 
quently does not preserve the functionality of internal nodes. If / was optimized 
subject to the don’t cares of the whole F then there may be no g = /. 

2. Suppose that we do find / = g; how do we take advantage of it? We cannot 
simply replace / and g with a new variable y. Even if / = g and F{f) = G{g) it 
is possible that F{y) ^ G{y) because F might have been optimized subject to the 
don’t cares of /. This problem is referred to as ’’false negative”, that is, / = g 
and F{y) ^ G(y) need not necessarily imply that F{f) ^ G{g). 

Our solution to the first problem is not to ask whether f = g, i.e., ”is / © g 
testable for stuck at 0?”. Instead we insert an XOR gate between / and all its 
immediate fanouts and make g the other input of the XOR gate (see Figure 2). 
Then we check whether the output of the XOR gate is testable (in the context of 
the surrounding logic of F). Suppose that it is not testable for stuck at 0. That 
means that for every input pattern either / = g, or / ^ g, but the difference 
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Figure 2. Can g replace f inside F?. 





Figure 3. g replaces f. 



cannot be propagated to primary outputs. In other words, g can replace / inside 
F without changing the functionality of F. Similarly, if the XOR gate is not 
testable for stuck at 1 then g can replace /. 

Our solution to the second problem is to actually perform the replacement. 
However, it is important that the logic of g not be copied inside F, but rather F 
and G must share the same copy of g (see Figure 3). Proving F{g) = G(g) is 
easier than proving F{f) = G{g) because the miter F{g) © G{g) is smaller than 
the miter F{f) © G{g). 

The whole process proceeds from inputs of F to its outputs. At each step we 
have a sub-function / and try to find a sub-function g of G that could replace /. 
If we find a suitable g we make the replacement, while retaining the functionality 
of F. As we proceed, F keeps resembling G more and more and the process 
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stops at the outputs of F, which should be replaceable by the outputs of G. If an 
output of G cannot replace the corresponding output of F then the test generator 
gives an input pattern constituting a counter-example. 

2. Algorithm 

As described in the introduction, the approach is based on the following 

Lemma 1 Let x be a vector of variables and y be a single variable. Consider 
arbitrary functions f{x),g{x),F{x,y). (Since all three Junctions depend on x we 
will drop X from our notation; for example, by F{f) we will mean F{x,f(x)).) 
The following two identities hold 

F{f)=F{g) ^ F{f®g) = m (1) 

F{f)=F(g) F{f®g) = F{l) (2) 

Proof: 

We will be done if we show that (1) and (2) are true for any input pattern x. 
For a given input x there are four possible combinations of values for / and g. 
For illustration we will show the case / = 0 and g = U all the other cases are 
similar. Substituting f = 0,g= I 
in (1) we get F(0) = F(l) F(1)=F(0) 
in (2) we get F(0) = F(0) ^ F(l) = F(l) 

Q.E.D. 

The right hand side of (1) says that the output of the XOR is not testable 
for stuck at 0 and the left hand side says that in this case we may replace f by 
g. Similarly (2) says that if the XOR is not testable for stuck at 1 then we can 
replace / by g. 

We now give the algorithm for proving F = G, where both functions are 
given by multi-output combinational networks. The algorithm is followed by a 
detailed explanation. 

0. for each net calculate its primary inputs 

1. form a list fs of nets in F in topological order 

2. for each / in fs 

3. form a list gs of some nets in G 

4. for each g in gs 

5. if F(/0g) = F(0) then replace fhyg 

6. if F(/© g) = F(l) then replace / by g 

For each of the above statements we will give an explanation as well as its 
computational complexity. Let the number of connections in both F and G be n. 
We will assume that the number of nets is also of the same order as n. 

Statement 0: calculates information used for a heuristic selection of candi- 
dates g in statement 3. It collects information whose size can be 0{nf) in the 
worst case. Therefore its time and space complexity is O(n^). 
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Statement 1: The only nets required in fs are primary outputs; all other nets 
are placed there only for speed. The more internal nets are included in fs, the 
more frequently we need to invoke the test generator, but the easier will be the 
test generator’s work because it will be working on smaller miters. 

Our heuristic is to have the nets in fs separated by approximately two stages 
of logic. In our experiments the number of stages of separation did not signifi- 
cantly affect the performance of the algorithm, provided for any miter that is too 
large we consider smaller miters in its place. The topological ordering is done 
in linear time. 

Statement 2: There can be at most n of the nets /. 

Statement 3: In contrast to fs the choice of gs has a tremendous impact on 
performance. If the list gs were too long then we would call the test generator 
too many times even for nets / that cannot be replaced by any g. On the other 
hand, if gs were too short then we might fail to find a g to replace a particular /, 
which would make the test generator’s job harder later on. Our heuristic forms a 
list gs giving preference to nets g with a name related to the name of /, followed 
by nets that have the same simulation result as / on a small number of random 
patterns, followed by all remaining nets. 

The list gs is then pruned in order to minimize the number of calls to the test 
generator: 

- We require that each g depend on no more primary inputs than / does. 

- We use approximate fault simulation to discard candidates g which are not 
likely to replace / (similarly as in [1]). 

Then the list gs is truncated after k elements; in our experiments we used k — 20. 

Statement 3 has a worst case complexity 0{n) because we may be forced 
to consider all n nets g. Since this statement is executed n times it contributes 
O(n^) to the overall algorithm. 

Statement 5 and 6: We call the test generator [11, 12] to determine whether g 
can replace /. If so, we make the replacement and terminate the loop of gs. This 
statement contributes only 0(n) complexity to the algorithm because we allow 
the test generator to run only a fixed amount of time independent of the size of 
the network. For each / the test generator is called at most k times. 

Thus the overall worst case complexity is O(n^), but we cannot guarantee 
to verify every design. In order to obtain such a guarantee we would have to 
allow an unbounded amount of test generation time, in which case the worst 
case complexity would depend exponentially on the differences between the 
two designs. 

The fact that in our experiments we did not experience an exponential blowup 
suggests that in synthesized logic it is normally possible to keep the miter size 
constant and small. This does not imply optimization based on local transfor- 
mations only; the test generator does take into account global information far 
away from the miter under consideration. 
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Experimental results 



Table 1. Verification of ISC AS benchmarks. 



CIRCUIT 


CONNECTS 


CPU TIME 
[sec] 


RELATIVE 

TIME 


BEFORE 


AFTER 


C17 


31 


19 


1 


20 


C432 


712 


346 


4 


4 


C499 


1105 


539 


38 


24 


C880 


1125 


609 


5 


3 


C1355 


1937 


903 


9 


3 


C1908 


2244 


647 


22 


9 


C2670 


2783 


1211 


58 


15 


C3540 


3666 


1652 


39 


8 


C5315 


5346 


2880 


29 


4 


C6288 


9120 


4690 


193 


13 


C7522 


8107 


3288 


136 


12 


S27 


26 


25 


1 


12 


S208 


187 


115 


1 


3 


S298 


267 


175 


1 


2 


S344 


304 


211 


1 


2 


S349 


308 


211 


1 


2 


S382 


336 


217 


1 


2 


S386 


367 


193 


1 


2 


S400 


350 


214 


1 


2 


S420 


373 


231 


5 


7 


S444 


382 


217 


2 


3 


S510 


456 


409 


3 


3 


S526 


1475 


309 


2 


2 


S526N 


475 


288 


2 


2 


S641 


617 


310 


3 


3 


S713 


668 


309 


3 


3 


S820 


799 


485 


5 


4 


S832 


811 


485 


11 


8 


S838 


739 


419 


47 


35 


S1196 


1055 


849 


8 


3 


S1238 


1087 


853 


10 


4 


S1238 


1087 


853 


10 


4 


S1423 


1260 


815 


8 


3 


S1488 


1420 


947 


11 


5 


S1494 


1426 


939 


10 


4 


S5378 


4475 


2096 


30 


4 


S9234 


8240 


2903 


77 


6 


S15850 


14343 


5616 


192 


8 


S35932 


30352 


17954 


519 


8 


S38417 


33798 


15187 


614 


9 


S38584 


34498 


19669 


736 


9 
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Table 1 shows results on the ISCAS benchmarks [3, 4]. Each benchmark 
was synthesized using the standard script of BooleDozer [13], which includes 
data flow optimization, redundancy removal, factoring, technology mapping and 
others. We did not run any timing optimization because it is very dependent on 
timing assertions and we do not anticipate that would it cause problems to our 
algorithm. On the other hand, to the standard script we added a transformation 
that collapses as large pieces of logic as possible into a two-level representation, 
minimizes and factors. We added this transformation because it destroys the 
original structure of the design and therefore makes it difficult to find an internal 
net g in the implementation that can replace a given net / in the specification. 

For each design we give the number of connections of the two designs being 
compared - before and after synthesis. Then we give the CPU time in seconds 
on IBM RS/6000 model 550. In the last column we give a ratio between the ver- 
ification time and the time to read in the two designs, because this is a measure 
independent of the hardware and other system variables that are not pertinent to 
our algorithm. (A linear algorithm would have a constant relative time.) 

4. Conclusions 

We have shown how a test generator can be used for verification, and demon- 
strated its effectiveness by verifying all the ISCAS benchmarks, which to our 
knowledge has not been accomplished by any other approach yet. 

The algorithm proved successful because there are apparently many nets in 
a specification for which there exists a corresponding net in the implementa- 
tion. This appears true even after extensive global optimizations. It is not very 
surprising, given the experience of transduction [15], which relies on related 
phenomenon. Namely, it is very common that in a logic network there are nets 
computing related functions possibly for unrelated reasons. 

It is possible, however, that our algorithm may fail on some designs in the 
future. Even then it would not be necessary for a designer to partition his design 
(which is a common industiy practice) because our algorithm is not as sensitive 
to design size as it is to the differences between the two designs being compared. 
Therefore any partitioning, if necessary, should be in the synthesis process. That 
is, we can write several checkpoint files during synthesis and do verification 
between successive pairs of checkpoint files. 
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Abstract 

This paper introduces GRASP (Generic seaRch Algorithm for the Satisfiability Problem), an in- 
tegrated algorithmic framework for SAT that unifies several previously proposed search-pruning 
techniques and facilitates identification of additional ones. GRASP is premised on the inevitabil- 
ity of conflicts during search and its most distinguishing feature is the augmentation of basic 
backtracking search with a powerful conflict analysis procedure. Analyzing conflicts to deter- 
mine their causes enables GRASP to backtrack non-chronologically to earlier levels in the search 
tree, potentially pruning large portions of the search space. In addition, by “recording” the causes 
of conflicts, GRASP can recognize and preempt the occurrence of similar conflicts later on in the 
search. Finally, straightforward bookkeeping of the causality chains leading up to conflicts allows 
GRASP to identify assignments that are necessary for a solution to be found. Experimental results 
obtained from a large number of benchmarks, including many from the field of test pattern gen- 
eration, indicate that application of the proposed conflict analysis techniques to SAT algorithms 
can be extremely effective for a large number of representative classes of SAT instances. 



1. Introduction 

The Boolean satisfiability problem (SAT) appears in many contexts in the 
field of computer-aided design of integrated circuits including automatic test 
pattern generation (ATPG), timing analysis, delay fault testing, and logic verifi- 
cation, to name just a few. Though well-researched and widely investigated, it 
remains the focus of continuing interest because efficient techniques for its solu- 
tion can have great impact. SAT belongs to the class of NP-complete problems 
whose algorithmic solutions are currently believed to have exponential worst 
case complexity [6]. Over the years, many algorithmic solutions have been pro- 
posed for SAT, the most well known being the different variations of the Davis- 
Putnam procedure [3]. The best known version of this procedure is based on a 
backtracking search algorithm that, at each node in the search tree, elects an 
assignment and prunes subsequent search by iteratively applying the unit clause 
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and the pure literal rules [18]. Iterated application of the unit clause rule is com- 
monly referred to as Boolean Constraint Propagation (BCP) or as derivation of 
implications in the electronic CAD literature [1]. 

Most of the recently proposed improvements to the basic Davis-Putnam pro- 
cedure [5, 10, 17, 18] can be distinguished based on their decision making 
heuristics or their use of preprocessing or relaxation techniques. Common to all 
these approaches, however, is the chronological nature of backtracking. Never- 
theless, non-chronological backtracking techniques have been extensively stud- 
ied and applied to different areas of Artificial Intelligence, particularly Truth 
Maintenance Systems (TMS), Constraint Satisfaction Problems (CSP) and Au- 
tomated Deduction, in some cases with very promising experimental results. 
(Bibliographic references to the work in these areas can be found in [15].) 

Interest in the direct application of SAT algorithms to electronic design au- 
tomation (EDA) problems has been on the rise recently [2, 10, 17]. In addi- 
tion, improvements to the traditional structural (path sensitization) algorithms 
for some EDA problems, such as ATPG, include search-pruning techniques that 
are also applicable to SAT algorithms in general [8, 9, 13]. The main purpose 
of this paper is to introduce a procedure for the analysis of conflicts in search 
algorithms for SAT. Even though the conflict analysis procedure is described 
in the context of SAT, it can be naturally extended to EDA-specific algorithms, 
thus complementing other well-known search-pmning techniques [2, 9]. 

The proposed conflict analysis procedure has been incorporated in GRASP 
(Generic seaRch Algorithm for the Satisfiability Problem), an integrated al- 
gorithmic framework for SAT. Several features distinguish the conflict analysis 
procedure in GRASP from others used in TMSs and CSPs. First, conflict anal- 
ysis in GRASP is tightly coupled with BCP and the causes of conflicts need 
not necessarily correspond to decision assignments. Second, clauses can be 
added to the original set of clauses, and the number and size of added clauses 
is user-controlled. This is in explicit contrast with nogood recording techniques 
developed for TMSs and CSPs. Third, GRASP employs techniques to prune the 
search by analyzing the implication structure generated by BCP. Exploiting the 
“anatomy” of conflicts in this manner has no equivalent in other areas. 

Some of the proposed techniques have also been applied in several structural 
ATPG algorithms [8, 16], among others. The GRASP framework, however, per- 
mits a unified representation of all known search-pruning methods and potenti- 
ates the identification of additional ones. The basic SAT algorithm in GRASP 
is also customizable to take advantage of application-specific characteristics to 
achieve additional efficiencies [13]. Finally, the framework is organized to allow 
easy adaptation of other algorithmic techniques, such as those in [2, 9], whose 
operation is orthogonal to those described here. 

The remainder of this paper is organized in four sections. In Section 2, we 
introduce the basics of backtracking search, particularly our implementation of 
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BCP, and describe the overall architecture of GRASP. This is followed, in Sec- 
tion 3, by a detailed discussion of the procedures for conflict analysis and how 
they are implemented. Extensive experimental results on a wide range of bench- 
marks, including many from the field of ATPG, are presented and analyzed in 
Section 4. In particular, GRASP is shown to outperform two recent state-of- 
the-art SAT algorithms [5, 17] on most, but not all, benchmarks. The paper 
concludes in Section 5 with some suggestions for further research. 

2. Definitions 

2.1 Basic Definitions and Notation 

A conjunctive normal form (CNF) formula (p on n binary variables 
is the conjunction (AND) of m clauses coi , . . . , each of which is the disjunc- 
tion (OR) of one or more literals, where a literal is the occurrence of a variable 
or its complement. A formula (p denotes a unique n-variable Boolean function 
f{xi ,...,x„) and each of its clauses corresponds to an implicate of /. Clearly, a 
function / can be represented by many equivalent CNF formulas. A formula is 
complete if it consists of the entire set of prime implicates for the correspond- 
ing function. In general, a complete formula will have an exponential number 
of clauses. We will refer to a CNF formula as a clause database and use “for- 
mula,” “CNF formula,” and “clause database” interchangeably. The satisfiabil- 
ity problem (SAT) is concerned with finding an assignment to the arguments of 
f{x \ , . . . , jc„) that makes the function equal to 1 or proving that the function is 
equal to the constant 0. 

A backtracking search algorithm for SAT is implemented by a search pro- 
cess that implicitly traverses the space of 2" possible binary assignments to the 
problem variables. During the search, a variable whose binary value has already 
been determined is considered to be assigned; otherwise it is unassigned with 
an implicit value of X = {0, 1}. A truth assignment for a formula (p is a set 
of assigned variables and their corresponding binary values. It will be conve- 
nient to represent such assignments as sets of variable/value pairs; for exam- 
ple A = {(xi,0), (x 7 , l),(xi3,0)}. Alternatively, assignments can be denoted as 
A = {(xi = 0), {x^ = 1), (xi 3 = 0)}. Sometimes it is convenient to indicate that a 
variable x is assigned without specifying its actual value. In such cases, we will 
use the notation v(x) to denote the binary value assigned to x. An assignment A 
is complete if |A| = n; otherwise it is partial. Evaluating a formula (p for a given 
truth assignment A yields three possible outcomes: (p|a = 1 and we say that (p 
is satisfied and refer to A as a satisfying assignment; (p|^ = 0 in which case <p 
is unsatisfied and A is referred to as an unsatisfying assignment; and (p|yi = X 
indicating that the value of (p cannot be resolved by the assignment. This last 
case can only happen when A is a partial assignment. An assignment partitions 
the clauses of (p into three sets: satisfied clauses (evaluating to 1); unsatisfied 
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clauses (evaluating to 0); and unresolved clauses (evaluating to X). The unas- 
signed literals of a clause are referred to as its free literak. A clause is said to 
be unit if the number of its free literals is one. 

2.2 Formula Satisfiability 

Formula satisfiability is concerned with determining if a given formula (p is 
satisfiable and with identifying a satisfying assignment for it. Starting from an 
empty truth assignment, a backtrack search algorithm traverses the space of truth 
assignments implicitly and organizes the search for a satisfying assignment by 
maintaining a decision tree. Each node in the decision tree specifies an elective 
assignment to an unassigned variable; such assignments are referred to as deci- 
sion assignments. A decision level is associated with each decision assignment 
to denote its depth in the decision tree; the first decision assignment at the root 
of the tree is at decision level 1. The search process iterates through the steps 
of: 

1 Extending the current assignment by making a decision assignment to an 
unassigned variable. This decision process is the basic mechanism for 
exploring new regions of the search space. The search terminates suc- 
cessfully if all clauses become satisfied; it terminates unsuccessfully if 
some clauses remain unsatisfied and all possible assignments have been 
exhausted. 

2 Extending the current assignment by following the logical consequences 
of the assignments made thus far. The additional assignments derived 
by this deduction process are referred to as implication assignments or, 
more simply, implications. The deduction process may also lead to the 
identification of one or more unsatisfied clauses implying that the current 
assignment is not a satisfying assignment. Such an occurrence is referred 
to as a conflict and the associated unsatisfying assignments, called con- 
flicting assignments. 

3 Undoing the current assignment, if it is conflicting, so that another as- 
signment can be tried. This backtracking process is the basic mechanism 
for retreating from regions of the search space that do not correspond to 
satisfying assignments. 

The decision level at which a given variable x is either electively assigned or 
forcibly implied will be denoted by 8{x). When relevant to the context, the 
assignment notation introduced earlier may be extended to indicate the decision 
level at which the assignment occurred. Thus, x = v@d would be read as “x 
becomes equal to v at decision level d.” 

The average complexity of the above search process depends on how deci- 
sions, deductions, and backtracking are made. It also depends on the formula 
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itself. The implications that can derived from a given partial assignment de- 
pend on the set of available clauses. In general, a formula consisting of more 
clauses will enable more implications to be derived and will reduce the number 
of backtracks due to conflicts. The limiting case is the complete formula that 
contains all prime implicates. For such a formula no conflicts can arise since 
all logical implications for a partial assignment can be derived. This, however, 
may not lead to shorter execution times since the size of such a formula may be 
exponential. 

2.3 Function Satisfiability 

Given an initial formula (p many search systems attempt to augment it with 
additional implicates to increase the deductive power during the search process. 
This is usually referred to as “learning” [12] and can be performed either as 
a preprocessing step (static learning) or during the search (dynamic learning). 
Even though learning as defined in [10, 12] only yields implicates of size 2 (i.e. 
non-local implications), the concept can be readily extended to implicates of 
arbitrary size. 

Our approach can be classified as a dynamic learning search mechanism 
based on diagnosing the causes of conflicts. It considers the occurrence of a 
conflict, which is unavoidable for an unsatisfiable instance unless the formula is 
complete, as an opportunity to “learn from the mistake that led to the conflict” 
and introduces additional implicates to the clause database only when it stum- 
bles. Conflict diagnosis produces three distinct pieces of information that can 
help speed up the search: 

1 New implicates that did not exist in the clause database and that can be 
identified with the occurrence of the conflict. These clauses may be added 
to the clause database to avert future occurrence of the same conflict and 
represent a form of conflict-based equivalence (CBE). 

2 An indication of whether the conflict was ultimately due to the most recent 
decision assignment or to an earlier decision assignment. 

■ If that assignment was the most recent (i.e. at the current decision 
level), the opposite assignment (if it has not been tried) is immedi- 
ately implied as a necessary consequence of the conflict; we refer to 
this as & failure-driven assertion (FDA). 

■ If the conflict resulted from an earlier decision assignment (at a 
lower decision level), the search can backtrack to the correspond- 
ing level in the decision tree since the subtree rooted at that level 
corresponds to assignments that will yield the same conflict. The 
ability to identify a backtracking level that is much earlier than the 
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Current Truth Assignment: {jcq = 0@1 ,jcio = 0@3 ,j:ii =0@3,jci2 = 1@2,jci3 = 

Current Decision Assignment: {xi = 1 @6} 

+^2) 

W2 = (-iJCl +JC3 +JC9) 

0)3 = (-1^:2 + ->X 2 +JC4) 

0)4 = (-•JC4 + JC5 +JC10) 

(ii 5 = hx 4 +X 6 -^xii) 

W6 = 

to? = (Xi +JC7 + -OC12) 

0)8 = (Xl +)C8) 

0)9 = (“'JC7 + -^Xs + -iJCls) 

Figure 1. 

current decision level is a form of non-chronological backtracking 
that we refer to as conflict-directed backtracking (CDB), and has 
the potential of significantly reducing the amount of search. 

These conflict diagnosis techniques are discussed further in Section 3. 

2.4 Structure of the Search Process 

The basic mechanism for deriving implications from a given clause database 
is Boolean constraint propagation (BCP) [5, 18]. Consider a formula (p contain- 
ing the clause (H = {x+ ->y) and assume y = 1. For any satisfying assignment to 
(p, © requires that x be equal to 1, and we say that y = 1 implies a: = 1 due to ©. 
In general, given a unit clause (li -t- . . . -1- /*) of (p with free literal Ij, consistency 
requires Ij = 1 since this represents the only possibility for the clause to be sat- 
isfied. If Ij = X, then the assignment a: = 1 is required; if Ij = ->x then = 0 is 
required. Such assignments are referred to as logical implications (implications, 
for short) and correspond to the application of the unit clause rule proposed by 
M. Davis and H. Putnam [3]. BCP refers to the iterated application of this rule 
to a clause database until the set of unit clauses becomes empty or one or more 
clauses become unsatisfied. 

Let the assignment of a variable x be implied due to a clause © = (/i -I- . . . -H 
/^) . The antecedent assignment of x, denoted as >l(x), is defined as the set 
of assignments to variables other than x with literals in ©. Intuitively, A(.x:) 
designates those variable assignments that are directly responsible for implying 
the assignment of x due to ©. For example, the antecedent assignments of x, y 
and z due to the clause © = (x-f y-t- -iz) are, respectively, A(x) = {y = 0,z = 1}, 
/l(y) = {;c = 0,z = 1}, and A(z) = {x = 0,y = 0}. Note that the antecedent 
assignment of a decision variable is empty. 

The sequence of implications generated by BCP is captured by a directed 
implication graph I defined as follows (see Figure 1): 
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1 Each vertex in I corresponds to a variable assignment x = v{x). 

2 The predecessors of vertex x = v{x) in 7 are the antecedent assignments 
A(jc) corresponding to the unit clause co that led to the implication of jc. 
The directed edges from the vertices in A{x) to vertex x = v(x) are all 
labeled with (O. Vertices that have no predecessors correspond to decision 
assignments. 

3 Special conflict vertices are added to I to indicate the occurrence of con- 
flicts. The predecessors of a conflict vertex k correspond to variable as- 
signments that force a clause (O to become unsatisfied and are viewed as 
the antecedent assignment A(k). The directed edges from the vertices in 
A{k) to K are all labeled with O). 

The decision level of an implied variable jc is related to those of its antecedent 
variables according to: 

5(jc) = max {5(>;)|(>-,v(y)) G A(jc)} (1) 

2.5 Search Algorithm Template 

The general structure of the GRASP search algorithm is shown in Figure 
2. We assume that an initial clause database (p and an initial assignment A, at 
decision level 0, are given. This initial assignment, which may be empty, may be 
viewed as an additional problem constraint and causes the search to be restricted 
to a subcube of the n-dimensional Boolean space. As the search proceeds, both 
(p and A are modified. The recursive search procedure consists of four major 
operations: 

1 DecideO, which chooses a decision assignment at each stage of the search 
process. Decision procedures are commonly based on heuristic knowl- 
edge. For the results given in Section 4, the following greedy heuristic is 
used: 

At each node in the decision tree evaluate the number of clauses 
directly satisfied by each assignment to each variable. Choose 
the variable and the assignment that directly satisfies the largest 
number of clauses. 

Other decision making procedures have been incorporated in GRASP, as 
described in [15]. 

2 DeduceO, which implements BCP and (implicitly) maintains the resulting 
implication graph. (See [15] for the details of Deduce().) 

3 DiagnoseO, which identifies the causes of conflicts and can augment the 
clause database with additional implicates. Realization of different con- 
flict diagnosis procedures is the subject of Section 3. 
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I I Global variables: Clause database (p 

// Variable assignment A 

I I Return value: FAILURE or SUCCESS 

// Auxiliary variables: Backtracking level P 

II 

GRASPO 

{ 

return (Search (0, p) != SUCCESS) ? FAILURE : SUCCESS; 

} 

II 

H Input argument: Current decision level d 

H Output argument: Backtracking level P 

// Return value: CONFLICT or SUCCESS 

// ’ 

Search(^/, &P) 

{ 

if (Decide(d) = SUCCESS) 
return SUCCESS; 
while (TRUE) { 
if (Deduce(</) != CONFLICT) { 

if (Searched + 1, P) == SUCCESS) { return SUCCESS; } 
else if (P != d) { Erase(); return CONFLICT; } 

} 

if (Diagnose(t/,P) == CONFLICT) { Erase(); return CONFLICT; } 
EraseO; 

} 

} 

// 

Diagnose(^/, &P) 

{ 

o)c(k) = Conflict Jnduced-ClauseO; // From (4) 

Update_Clause_Database((Oc(K)); 

p = Compute_Max-Level(); // From (7) 

if(P !=^){ 

add new conflict vertex K to /; 
record A (k); 
return CONFLICT; 

} 

return SUCCESS; 

} 



Figure 2. Description of GRASP. 



4 EraseO, which deletes the assignments at the current decision level. 

We refer to Decide(), Deduce() and Diagnose() as the Decision, Deduction 
and Diagnosis engines, respectively. Different realizations of these engines lead 
to different SAT algorithms. For example, the Davis-Putnam procedure can be 
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emulated with the above algorithm by defining a decision engine, requiring the 
deduction engine to implement BCP and the pure literal rule, and organizing the 
diagnosis engine to implement chronological backtracking. 

3. Conflict Analysis Procedures 

When a conflict arises during BCP, the structure of the implication sequence 
converging on a conflict vertex K is analyzed to determine those (unsatisfy- 
ing) variable assignments that are directly responsible for the conflict. The 
conjunction of these conflicting assignments is an implicant that represents a 
sufficient condition for the conflict to arise. Negation of this implicant, there- 
fore, yields an implicate of the Boolean function / (whose satisfiability we seek) 
that does not exist in the clause database cp. This new implicate, referred to as 
a conflict-induced clause, provides the primary mechanism for implementing 
failure-driven assertions, non-chronological conflict-directed backtracking, and 
conflict-based equivalence (see Section 2.3). In TMS [16] and in some algo- 
rithms for CSP [11], “nogoods” provide conditions similar to conflict-induced 
clauses. Nevertheless, the basic mechanism for creating conflict-induced clauses 
differs. 

We denote the conflicting assignment associated with a conflict vertex K by 
Ac(k) and the associated conflict-induced clause by O)c(k). The conflicting as- 
signment is determined by a backward traversal of the implication graph starting 
at K. Besides the decision assignment at the current decision level, only those as- 
signments that occurred at previous decision levels are included in Ac(k). This 
is justified by the fact that the decision assignment at the current decision level 
is directly responsible for all implied assignments at that level. Thus, along with 
assignments from previous levels, the decision assignment at the current deci- 
sion level is a sufficient condition for the conflict. To facilitate the computation 
of Ac(k) we partition the antecedent assignments of K as well as those for vari- 
ables assigned at the current decision level into two sets. Let x denote either K 
or a variable that is assigned at the current decision level. The partition of A(x) 
is then given by: 



For example, referring to the implication graph of Figure 1, A(x6) = {x:ii = 
0@3} and S(jC6) = {a :4 = 1@6}. Determination of the conflicting assignment 
Ac(k) can now be computed using the following recursive definition: 



AW = {(y,v(y)) 6A(x)15(y) <5 (x)} 
^W = -iWvW) eAW|6W = 6W} 



( 2 ) 




otherwise 



( 3 ) 
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and starting with jc = K. The conflict-induced clause corresponding to Ac(k) is 
now determined according to; 

®c(k) = X 

(^,V(*))€/ic(K) 

where, for a binary variable r, = r and jc* = -<x. Application of (2)-(4) to 
the conflict depicted in Figure 1 yields the following conflicting assignment and 
conflict-induced clause at decision level 6: 

Ac(k) = {j:i = 1,^9 = 0,xio = 0,xii = 0} 

©c(k) = (-iJCi -|-X9-|-rio-l-J[:ii) 

3.1 Standard Conflict Analysis Engine 

The identification of a conflict-induced clause o)c(k) enables the derivation 
of further implications that help prune the search. Immediate implications of 
a)c(K) include asserting the current decision variable to its opposite value and 
determining a backtracking level for the search process. Such immediate impli- 
cations do not require that (Oc(k) be added to the clause database. Augmenting 
the clause database with (0 c(k), however, has the potential of identifying future 
implications that are not derivable without (Oc(k). In particular, adding coc(k) 
to the clause database ensures that the search engine will not regenerate the con- 
flicting assignment that led to the current conflict. 

3.1.1 Failure-Driven Assertions. If coc(k) involves the current 
decision variable, erasing the implication sequence at the current decision level 
makes coc(k) a unit clause and causes the immediate implication of the decision 
variable to its opposite value. We refer to such assignments as failure-driven as- 
sertions (FDAs) to emphasize that they are implications of conflicts and not deci- 
sion assignments. We note further that their derivation is automatically handled 
by our BCP-based deduction engine and does not require special processing. 
This is in contrast with most search-based SAT algorithms that treat a second 
branch at the current decision level as another decision assignment. Using our 
running example (see Figure 1) as an illustration, we note that after erasing the 
conflicting implication sequence at level 6, the conflict-induced clause O)c(k) in 
(5) becomes a unit clause with -ijci as its free literal. This immediately implies 
the assignment = 0 and jci is said to be asserted. 

3.1.2 Conflict-Directed Backtracking. If all the literals in O)c(k) 
correspond to variables that were assigned at decision levels that are lower than 
the current decision level, we can immediately conclude that the search process 
needs to backtrack. This situation can only take place when the conflict in ques- 
tion is produced as a direct consequence of diagnosing a previous conflict and 
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(a) Conflicting implication sequence 




(b) Decision tree 



Figure 3. Non-chronological backtracking. 

is illustrated in Figure 3 (a) for our working example. The implication sequence 
generated after asserting jci = 0 due to conflict k leads to another conflict k'. 
The conflicting assignment and conflict-induced clause associated with this new 
conflict are easily determined to be 

) ■^.^9 OjXiQ /^\ 

COc(k') = (j:9+-*10 + ^11 +“'-«12 + -'-*13) 

and clearly show that the assignments that led to this second conflict were all 
made prior to the current decision level. 

In such cases, it is easy to show that no satisfying assignments can be found 
until the search process backtracks to the highest decision level at which assign- 
ments in i4c(K') were made. Denoting this backtrack level by P, it is simply 
calculated according to: 

P = max {5(x)l(x,v(x)) € Ac(k')} (7) 

When P = d — 1, where d is the current decision level, the search process back- 
tracks chronologically to the immediately preceding decision level. When P < 
d-\, however, the search process may backtrack non-chronologically by jump- 
ing back over several levels in the decision tree. It is worth noting that all truth 
assignments that are made after decision level P will force the just-identified 
conflict-induced clause coc(k') to be unsatisfied. A search engine that back- 
tracks chronologically may, thus, waste a significant amount of time exploring 
a useless region of the search space only to discover after much effort that the 
region does not contain any satisfying assignments. In contrast, the GRASP 
search engine jumps directly from the current decision level back to decision 
level p. At that point, (Oc(k') is used to either derive a FDA at decision level P 
or to calculate a new backtracking decision level. 

For our example, after occurrence of the second conflict the backtrack deci- 
sion level is calculated, from (7), to be 3. Backtracking to decision level 3, the 
deduction engine creates a conflict vertex corresponding to ©c(kO- Diagnosis 
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of this conflict leads to a FDA of the decision variable at level 3 (see Figure 3 
(b)). 

The pseudo-code illustrating the main features of the diagnosis engine in 
GRASP is shown in Figure 2. General proofs of the soundness and completeness 
of GRASP can be found in [7, 14]. 

3.2 Variations on the Standard Diagnosis Engine 

The standard conflict diagnosis, described in the previous section, suffers 
from two drawbacks. First, conflict analysis introduces significant overhead 
which, for some instances of SAT, can lead to large run times. Second, the size 
of the clause database grows with the number of backtracks; in the worst case 
such growth can be exponential in the number of variables. 

The first drawback is inherent to the algorithmic framework we propose. For- 
tunately, the experimental results presented in Section 4 clearly suggest that, for 
specific instances of SAT, the performance gains far outweigh the procedure’s 
additional overhead. 

One solution to the second drawback is a simple modification to the conflict 
diagnosis engine that guarantees the worst case growth of the clause database to 
be polynomial in the number of variables. The main idea is to be selective in 
the choice of clauses to add to the clause database. Assume that we are given 
an integer parameter k. Conflict-induced clauses whose size (number of literals) 
is no greater than k are marked green and handled as described earlier by the 
standard diagnosis engine. Conflict-induced clauses of size greater than k are 
marked red and kept around only while they are unit clauses. Implementation of 
this scheme requires a simple modification to procedure Erase(), which must 
now delete red clauses with more than one free literal, and to the diagnosis 
engine, which must attach a color tag to each conflict-induced clause. With 
this modification the worst case growth becomes polynomial in the number of 
variables as a function of the fixed integer k. 

Further enhancements to the conflict diagnosis engine involve generating 
stronger implicates (containing fewer literals) by more careful analysis of the 
structure of the implication graph. Such implicates are associated with the dom- 
inators [15] of the conflict vertex K. These dominators, referred to as unique 
implication points (UIPs), can be identified in linear time with a single traversal 
of the implication graph. Additional details of the above improvements to the 
standard diagnosis engine can be found in [15]. 

4. Experimental Results 

In this section we present an experimental comparison of GRASP with 
two state-of-the-art and publicly available SAT programs, TEGUS [17] and 
POSIT [5]. TEGUS was adapted to read CNF formulas and augmented to con- 
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tinue searching when all its default options were exhausted in order to abort 
fewer faults. No changes were made to POSIT. 

GRASP and POSIT have been implemented in C++, whereas TEGUS has 
been implemented in C. The programs were compiled with GCC 2.7.2 and run 
on a SUN SPARC 5/85 machine with 64 MByte of RAM. The experimental 
evaluation of the three programs is based on two different sets of benchmarks: 

■ The UCSC benchmarks [4], developed at the University of California, 
Santa Cruz, that include instances of SAT commonly encountered in test 
pattern generation of combinational circuits for bridging and stuck-at faults, 

■ The DIMACS challenge benchmarks [4], that include instances of SAT 
from several authors and from different application areas. 

For the experimental results given below, GRASP was configured to use the 
decision engine described in Section 2.5, to allow the generation of clauses 
based on UIPs, and to limit the size of clauses added to the clause database 
to 20 or fewer literals. All SAT programs were run with a CPU time limit of 
10,000 seconds (about three hours). 

For the tables of results the following definitions apply. A benchmark suite 
is partitioned into classes of related benchmarks. In each class, #M denotes the 
total number of class members; #S denotes the number of class members for 
which the program terminated in less than the allowed 10,000 CPU seconds; 
and Time denotes the total CPU time, in seconds, taken to process all members 
of the class. 

The results obtained for the UCSC benchmarks are shown in Table 1. The 
BF and SSA benchmark classes denote, respectively, CNF formulas for bridging 
and stuck-at faults. For these benchmarks GRASP performs significantly better 
than the other programs. Both POSIT and TEGUS abort a large number of 
problem instances and require much larger CPU times. These benchmarks are 
characterized by extremely sparse CNF formulas for which BCP-based conflict 
analysis works particularly well. The performance difference between GRASP 
and TEGUS, a very efficient ATPG tool, clearly illustrates the power of the 
search-pruning techniques included in GRASP. 

An experimental study of the effect of the growth of the clause database on 
the amount of search and the CPU time can be found in [15]. In general, adding 
larger clauses helps reducing the number of backtracks and the CPU time. This 
holds true until the overhead introduced by the additional clauses offsets the 
gains of reducing the amount of search. 

GRASP was also compared with the other algorithms on the DIMACS bench- 
marks [4]], and the results are included in Table 1. We can conclude that for 
classes of benchmarks where GRASP performs better the other programs either 
take a very long time to find a solution or are unable to find a solution in less 
than 10,000 seconds. We can also observe that benchmarks on which POSIT 
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Benchmark 

Class 


#M 


GRASP 


TEGUS 


POSIT 


#S 


Time 


#s 


Time 


#s 


Time 


BF-0432 


21 


21 


47.6 


19 


53,852 


21 


55.8 


BF-1355 


149 


149 


125.7 


53 


993,915 


64 


946,127 


BF-2670 


53 


53 


68.3 


25 


295,410 


53 


2,971 


SSA-0432 


7 


7 


1.1 


7 


1,593 


7 


0.2 


SSA-2670 


12 


12 


51.5 


0 


120,000 


12 


2,826 


SSA-6288 


3 


3 


0.2 


3 


17.5 


3 


0.0 


SSA-7552 


80 


80 


19.8 


80 


3,406 


80 


60.0 


AIM-100 


24 


24 


1.8 


24 


107.9 


24 


1,290 


AIM-200 


24 


24 


10.8 


23 


14,059 


13 


117,991 


BF 


4 


4 


7.2 


2 


26,654 


2 


20,037 


DUBOIS 


13 


13 


34.4 


5 


90,333 


7 


77,189 


11-32 


17 


17 


7.0 


17 


1,231 


17 


650.1 


FRET 


8 


8 


18.2 


4 


42,579 


4 


40,691 


SSA 


8 


8 


6.5 


6 


20,230 


8 


85.3 


AIM-50 


24 


24 


0.4 


24 


2.2 


24 


0.4 


II-8 


14 


14 


23.4 


14 


11.8 


14 


2.3 


JNH 


50 


50 


21.3 


50 


6,055 


50 


0.8 


PAR-8 


10 


10 


0.4 


10 


1.5 


10 


0.1 


PAR-16 


10 


10 


9,844 


10 


9,983 


10 


72.1 


11-16 


10 


9 


10,311 


10 


269.6 


9 


10,120 


H 


7 


5 


27,184 


4 


32,942 


6 


11,540 


F 


3 


0 


30,000 


0 


30,000 


0 


30,000 


G 


4 


0 


40,000 


0 


40,000 


0 


40,000 


PAR-32 


10 


0 


100,000 


0 


100,000 


0 


100,000 



Table 1. Results on the UCSC and DIMACS benchmarks. 



performs better than GRASP can also be handled by GRASP; only the overhead 
inherent to GRASP becomes apparent. 

Another useful experiment is to measure how well conflict analysis works 
in practice. For this purpose statistics regarding some DIMACS benchmarks 
are shown in Table 2, where #B denotes the number of backtracks, #NCB de- 
notes the number of non-chronological backtracks, #LJ is the size of the largest 
non-chronological backtrack, #UIP indicates the number of unique implication 
points found, %G denotes the variation in size of the clause database, and Time 
is the CPU time in seconds. From these examples several conclusions can be 
drawn. First, the number of non-chronological backtracks can be a significant 
percentage of the total number of backtracks. Second, the jumps in the decision 
tree can save a large amount of search work. As can be observed, in some cases 
the jumps taken potentially save searching millions of nodes in the decision tree. 
Third, the growth of the clause database is not necessarily large. Fourth, UIPs do 
occur in practice and for some benchmarks a reasonable number is found given 
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Benchmark 


#B 


#NCB 


#LJ 


#U1P 


%G 


GRASP 

Time 


TEGUS 

Time 


POSIT 

Time 


aim.200.2.y2 


109 


50 


13 


25 


153 


0.38 


2.80 


7,991 


aim.200.2.y3 


74 


35 


16 


15 


100 


0.31 


0.64 


>10,000 


aim.200.2.nl 


29 


20 


12 


5 


23 


0.13 


69.93 


>10,000 


aim.200.2.n2 


39 


20 


37 


4 


44 


0.19 


87.53 


>10,000 


bf0432-007 


335 


124 


17 


32 


48 


5.18 


6,649 


11.79 


bfl355-075 


40 


20 


24 


2 


7 


1.25 


4.83 


>10,000 


bfl355-638 


11 


7 


8 


4 


1 


0.32 


>10,000 


>10,000 


bf2670-001 


16 


8 


22 


2 


3 


0.40 


>10,000 


25.64 


dubois30 


233 


72 


16 


21 


466 


0.68 


>10,000 


>10,000 


duboisSO 


485 


175 


26 


51 


632 


2.80 


>10,000 


>10,000 


duboislOO 


1438 


639 


67 


150 


1034 


26.22 


>10,000 


>10,000 


pret60-40 


147 


98 


17 


8 


407 


0.41 


652.30 


175.49 


pret60_60 


131 


83 


16 


10 


354 


0.35 


639.27 


173.12 


pret 150-25 


428 


313 


38 


35 


588 


4.84 


>10,000 


>10,000 


pret 150-75 


388 


257 


49 


20 


447 


3.85 


>10,000 


>10,000 


ssa0432-003 


37 


6 


5 


1 


31 


0.15 


221.71 


0.01 


ssa2670-130 


130 


45 


34 


10 


17 


2.07 


>10,000 


14.23 


ssa2670-141 


377 


97 


16 


28 


66 


3.42 


>10,000 


70.82 


iil6al 


110 


19 


13 


0 


0 


13.61 


5.99 


>10,000 


iil6b2 


2664 


120 


9 


39 


64 


175.85 


6.94 


16.38 


iil6bl 


88325 


2588 


41 


624 


132 


>10,000 


21.65 


16.73 



Table 2. Statistics of running GRASP on selected benchmarks. 



the number of backtracks. Finally, for most of these examples conflict analysis 
causes GRASP to be much more efficient than POSIT and TEGUS. Neverthe- 
less, either POSIT or TEGUS can be more efficient in specific benchmarks, as 
the examples of the last three rows of Table 2 indicate. TEGUS performs partic- 
ularly well on these instances because they are satisfiable and because TEGUS 
iterates several decision making procedures. 

5. Conclusions and Research Directions 

This paper introduces a procedure for conflict analysis in satisfiability algo- 
rithms and describes a configurable algorithmic framework for solving SAT. 
Experimental results indicate that conflict analysis and its by-products, non- 
chronological backtracking and identification of equivalent conflicting condi- 
tions, can contribute decisively for efficiently solving a large number of classes 
of instances of SAT. For this purpose, the proposed SAT algorithm is compared 
with other state-of-the-art algorithms. 

The natural evolution of this research work is to apply GRASP to different 
EDA applications, in particular test pattern generation, timing analysis, delay 
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fault testing and equivalence checking, among others. Despite being a fast SAT 
algorithm, GRASP introduces noticeable overhead that can become a liability 
for some of these applications. Consequently, besides the algorithmic organi- 
zation of GRASP, special attention must be paid to the implementation details. 
One envisioned compromise is to use GRASP as the second choice SAT al- 
gorithm for the hard instances of SAT whenever other simpler, but with less 
overhead, algorithms fail to find a solution in a small amount of CPU time. 

Future research work will emphasize heuristic control of the rate of growth of 
the clause database. Another area for improving GRASP is related with the de- 
duction engine. Improvements to the BCP-based deduction engine are described 
in [14] and consist of different forms of probing the CNF formula for creating 
new clauses. This approach naturally adapts and extends other deduction proce- 
dures, e.g. recursive learning [9] and transitive closure [2]. 
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Abstract 

At the inception of ICCAD in 1983, system-level design was only a small fish in the EDA pond. 
In the earlier conferences, only one or at most 2 sessions were dedicated to the topic. This 
has changed dramatically over the years, and today system design is one of the pillars of the 
conference. In this paper, we describe the major trends in the field as can be traced from the 
papers published in the conference as well as other seminal publications. While doing so, we put 
the papers selected for this volume in the context of the ongoing trends at the time of publication, 
and their impact on the field. In addition, we provide some background that may help to determine 
why some proposed approaches did or did not succeed in the long term. We conclude the paper 
with some reflections on the past and the future. 



1. Some Reflections 

When browsing through the papers, one cannot escape the realization that the 
concept of “system” has evolved substantially over the twenty year history of 
the conference. One can conclude upfront that there are perhaps as many defi- 
nitions of an electronic system as there are engineers designing them. A famous 
(or infamous) quote states that the best definition of a system is “the design 
abstraction just above the level a designer is currently working on.” Although 
ICCAD does not stand for CAD for IC’s, it is clear that its concept of what a 
system is coincides roughly with what can be integrated on a single digital chip 
over time. Moore’s law has allowed us to integrate ever more parts of complete 
systems on a single piece of silicon. Hence many people today use the term 
“System-On-a-Chip”, although in reality no chip contains a complete system in 
itself. We must keep in mind that complete systems such as telecom, automo- 
tive, broadcasting services have a much broader scope of which the digital SoC 
is only a subsystem embedded in transducer networks, complex analog and RF 
subsystems with which it interacts very closely. Last but not least, a complete 
system view refers to the interaction of this heterogeneous electronic system 
with the non-electronic environment. 
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Hence it is true that SoC design is rapidly evolving from simple component 
design to design of complete complex subsystems on a chip and that the tradi- 
tional chip-architects must become more and more familiar with the challenging 
application domains of the future. How ICCAD will cope with that in the fu- 
ture is an interesting question but a first step in that direction can be found in 
the embedded tutorials on application domains such as mobile and broadband 
communication, multimedia, microsystems and the like ([1-5]). 

With this constraint in mind one has to recognize that “system design” in the 
ICCAD sense so far has been restricted to the digital domain and mainly refers 
to design technology above the logic and RT-levels, whereby the goal is to map 
or refine a behavioural description directly into micro-architectures that, in turn, 
can be transformed into the layout or memory content of digital processors on a 
chip. 

In this paper, we sketch the evolution of system-level design domain as ob- 
served from twenty years of ICCAD. We conclude the paper with some reflec- 
tions. 



2. The Evolution of System-Level Design in 20 
Years of ICCAD 

2.1 The Pre-ICCAD Period 

Mapping behaviour into silicon is a two-step process. On the one hand, it re- 
quires a behavioural or high-level synthesis step translating a specification into 
a micro-architecture. In a second step, the micro-architectural structure must be 
mapped into a layout. A specification consists of a behavioural description in a 
high-level language together with a set of area, time and power constraints. Ba- 
sically this requires three ingredients: a language, a target architectural style and 
a method to synthesize layout from the structural description of the architecture. 
In all three areas some work took place before the first ICCAD in 1983. Work 
on micro-architectural synthesis of dedicated data-paths executing instructions 
generated from a finite state machine (later called FSMD) started in the late 
seventies. Pioneering work was done by G. Zimmerman at Kiel University [6] 
resulting in the MIMOLA system and by D. Thomas’ group in CMU’s leading to 
the System Architect Workbench [7]. These efforts, whose goal was to generate 
processor architectures from the instruction set, laid the foundations for high- 
level synthesis. The internal representation by a control- and data-flowgraph 
(CDFG), source level optimization, operation scheduling, resource allocation, 
operation-to-operator assignment, and controller synthesis were clearly identi- 
fied and the first tools were demonstrated. 

Perhaps one of the biggest bottlenecks for the success of high-level synthe- 
sis then and still today was the lack of a proper description language. The first 
attempts to create behavioural languages allowing for bit-true data-types date 
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back to the pioneering work of Piloty on CONLAN [8] and Barbacci on ISPS 
[9]. Up to 1987 there was no standardised HDL language like VHDL or VER- 
ILOG available and, as a result, high level synthesis tools and systems have been 
developed around many different custom languages that allowed easy compil- 
ing into the internal Control-Data Flow Graph (CDFG) representation but were 
foreign to the designer community. 

As far as implementation into silicon is concerned, it is interesting to see that 
high-level mapping into register transfer architectures historically preceded the 
explosion of logic synthesis that took place from the late seventies till the mid- 
eighties. In the late seventies excellent heuristic algorithms for two-level logic 
optimisation were developed that could be mapped rather straightforwardly into 
PLA layout structures for the control part of the architectures [12]. Synthesis of 
much more efficient multi-level logic and its automated implementation in stan- 
dard cell layout became practical only after the invention of algebraic methods 
for logic synthesis and their implementation in a comprehensive CAD system 
such as MIS, presented in 1986 by Brayton et al [10, 11]. It took even longer 
before logic synthesis was capable of handling the regular bit-vector structure 
of data-path operators. 

This explains the early success of the so-called procedural layout generators 
or module generators that mapped the logic repetitive structure of data-paths 
into an abutment of the data-path leaf cells. This technique dates back to the 
landmark paper in 1979 by Johansen of Caltech who launched the term Silicon 
Compilation [13]. 

2.2 The Silicon Compilation Era 

Walking through the early ICCAD proceedings, one can see that papers in 
the category “Systems” in the 1983-1986 era were predominantly addressing 
Silicon Compilers in the Johansen sense. Procedural module compilers became 
popular for the layout generation of data-paths in microprocessors and DSP pro- 
cessors. For example, in one of the first papers presented at ICCAD, Buric et 
al. [14] describe how to impose timing and electrical constraints in data path 
compile rs. It took until 1987 before this approach became substituted by logic 
synthesis tools and technology mapping into standard cells. Today, module gen- 
eration remains only in vogue for highly-regular parameterised structures such 
as embedded memory. This might change in the future when optical proxim- 
ity effects in lithography, process variations and signal-integrity considerations, 
may ask again for high degree of regularity, even for the implementation of logic 
functions. 
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2.3 Architectural or Behavioural Synthesis 

Although activities in high-level synthesis (HLS) were reported extensively 
at the Design Automation Conference from 1979 on, it would take until 1985 for 
the first session on high-level synthesis to appear at ICCAD. From then on, the 
synthesis of hardware architectures remained a topic of interest for many years, 
with many excellent contributions on the classical aspects of scheduling, alloca- 
tion and assignment, the three basic steps of all high-level synthesis systems. In 
retrospect, it is important to see that, over time, many researchers came to the 
conclusion that, although the basic steps are always there, their sequence and 
nature depends very much on the so-called target architecture to be synthesised, 
and that in turn this target architecture is a strong function of the application 
domain. We believe that this is one of the basic reasons (besides the specifica- 
tion language) that high-level synthesis, in spite of its great potential, has not 
yet reached (or will maybe never reach) the commercial success of RTL based 
register-transfer level synthesis, now in general use in the industry. 

The original target domains for HLS were processor data-paths (synthesized 
from an instruction set), and real-time digital signal processors. A pioneering 
example of the former application-type, which is control dominated, was the 
Yorktown Heights Silicon Compiler of IBM [15]. Yet, it was in the digital sig- 
nal processing (DSP) arena that HLS made its first real inroads and industrial 
impact. DSP is data-dominated and transformational in nature whereby repeti- 
tive processing on streams of multi-dimensional arrays is dominant. A hallmark 
example of a synthesis environment in this class was the CATHEDRAL-II com- 
piler from IMEC [16]. 

Notice that both compilers need very different scheduling techniques. Cam- 
posano introduced path-based scheduling in [17]. Care is taken that an opti- 
mum schedule length is obtained for every possible path in the CDFG, which 
is predominantly a CFG. A considerable improvement on this technique was 
later introduced in [18]. In the area of data-centric scheduling, a pioneering ap- 
proach was introduced by Goossens et al. (included in this volume at page 107) 
[19]. Here the main attention is the scheduling of repetitive data-flow speci- 
fications containing nested loops and multi-dimensional signal arrays. This list 
based scheduling for DSP and its later adaptations are successfully implemented 
in commercial high-level synthesis systems The architectural style for the first 
HLS systems was basically a multi-operator datapath with local register files and 
a rich instruction set. Register files are interconnected by a custom-designed 
network, and the instructions are generated by a (primitive) PLA based fixed 
instruction sequencer. Today we would call this a VLIW architecture with a 
fixed microcoded controller. These architectures could successfully handle DSP 
algorithms in the audio domain ,given the CMOS technology available at that 
time, but were not powerful enough to meet the speed requirements of video. 
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graphics and/or radar applications. Highly pipelined and specialised data-paths 
with specialized instruction sets were needed for those tasks. It is remarkable 
that suddenly in 1989 a number of pioneering contributions in this domain were 
made. Two schools of thought emerged: Park and Kurdahi [20], and Hwang 
et al. [21] proposed to perform pipelined scheduling first, followed by alloca- 
tion. Note et al [22] on the other hand, use the structure of nested-loop code 
to assemble single-cycle programmable data-path operators by clustering sim- 
ilarly structured CDFG sub-graphs into dedicated data path structures, and use 
the pipelined scheduling techniques afterwards. The latter method formed the 
basis for several high throughput compilers for video and high throughput sys- 
tems such as the PHIDEO compiler from Philips Semiconductor [24]. Most 
of these compilers made use of the innovative force-directed scheduling tech- 
nique proposed by Paulin in 1987 [23], which proved to be very well suited for 
time-critical pipelined scheduling. A considerable extension to this end (and its 
application in PHIDEO) was proposed in 1992 by Verhaegh et al. [24]. Most 
scheduling problems in system level design are subject to strong external timing 
and synchronisation constraints. This lead after 1994 to increased attention to a 
new approach using automata theory and symbolic techniques based on ROB- 
DDs. Pioneering work in this area was reported in [62] and [63]. On the other 
hand, it became apparent however that in multi-media architectures, the mem- 
ory structures used to store the multidimensional signals often tend to dominate 
area, timing and power dissipation. This was first observed in two landmark 
papers at ICCAD-91 ([25,26]), and has led to a waterfall of contributions in 
the subsequent years (for instance, [27,28]). The 1994 contribution of Kolson 
et al. [29] is one of the first papers to recognize the importance of code trans- 
formations to reduce memory size (area) and data-transfers (power). Since then 
numerous contributions have been made that lead to dramatic performance im- 
provement and power reduction in memory architectures. An in-depth overview 
of this impact was presented in a tutorial paper in 2000 [30]. While optimization 
of the memory itself is important, one should not ignore the address generation 
process. Research in the area of the automatic synthesis of address generators 
was triggered by a seminal paper by Grant and Denyer in 1988 [31] and was 
followed by others in subsequent years. 

Similar to developments in optimizing compilers, researchers in the HLS 
world quickly realized that optimizing transformations are an essential com- 
ponent of any synthesis environment. In effect, the potential impact of trans- 
formations is even greater here, because of the broader design space and the 
wider variety of cost functions. Initial transformation and design space explo- 
ration systems (such as HYPER [32]) focused primarily on the minimization of 
traditional cost functions such as speed and area in data-path-like architectures. 
As mentioned in the preceding paragraph, this was later followed by transfor- 
mational tools targeting the memory architecture (e.g. [30]). In the early 90s, 
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power minimization became a topic of the foremost importance. Pioneering re- 
search in low-power design techniques was performed at UC Berkeley in the 
infopad project. One of the major findings was that the biggest gains in power 
efficiency are obtained by appropriate transformations at algorithmic level com- 
bined with optimal supply voltage selection. This knowledge was implemented 
in the HYPER synthesis tool by Chandrakasan et al [33]. The paper, included in 
this volume at page 1 17, draws the attention to the fact that, for the first time, the 
supply voltage becomes an additional parameter dimension in the design explo- 
ration space. This has had great impact on the daily low power design practice. 
This pioneering paper inspired a chain of refinements (such as [34] and [35], 
just to mention a few). 

In summary, it is fair to state that ICCAD contributed considerably to the 
basic design knowledge, methodology, and tools for the synthesis of fixed hard- 
ware co-processor architectures. 

2.4 The Advent of Hardware-Software Co-Design 

In the beginning of the nineties it became clear that the use of programmable 
cores, combined with hardware coprocessors communicating over customized 
bus networks, became the preferred target architecture for embedded applica- 
tions rather than the hardwired special-purpose processor offered by HLS. In 
spite of an area, power and performance overhead, programmable components 
allow for a high degree of reuse, offer a higher system design productivity, allow 
for late-binding of function to component reducing the design risk, and enable 
upgrading in the field. Hence a new dimension in the design space was added: 
software embedded in a hardware architecture, quickly dubbed as hardware- 
software codesign (HSC). 

This trend starts in 1992 with a tutorial and 3 papers in session 10. In the 
beginning most work concentrates on single processor-single accelerator co- 
processor architectures. After 1999, due to further integration capability, the 
attention shifts to multi-processor systems and performance estimation. This 
leads to a waterfall of seminal papers such as [55-57]. Starting in 1995 also the 
topic of dynamic power management of embedded systems pops up and in the 
years 1998-2002 dynamic voltage and frequency scaling become the most pop- 
ular topics to control power dissipation in systems dominated dynamic software 
tasks. A good overview of work on dynamic power management can be found 
in [58, 59] by Benini and De Micheli while [60] by Shina and Chandrakasan is 
reprensentative for the state of the art in dynamic voltage scaling. 

Work in the multiprocessor area then leads to the concept of platform archi- 
tectures, discussed in depth in a tutorial at ICCAD 2001. Since software is very 
energy consuming there is an increasing attention to characterizing and opti- 
mizing software power dissipation, a topic totally neglected by the traditional 
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computer science discipline. The term Hardware dependent Software (HdS) is 
launched. The main (initial) issues of research the HSC space are: 

■ Co-simulation of hardware and software 

■ Optimal hardware-software partitioning 

■ Interface synthesis 

■ Power and performance estimation and optimization of embedded soft- 
ware implementations 

At ICCAD little attention was paid to the first topic and, although some pa- 
pers at the conference still address the partitioning issue, today’s HSC practice 
indicates that such partitioning is either clear from an application’s standpoint or 
can best be found using an interactive system that allows the user to explore the 
design space either using high-level estimations [61] or by repeated synthesis as 
an estimation tool. In contrast, pioneering and influential work was presented at 
the conference in the latter two areas. 

When putting multiple components of a heterogeneous nature together on a 
die, ensuring correct communication between them is of primary importance. 
Rather than relying on predefined interfaces, automatic synthesis of the inter- 
face is a far more effective approach. Groundbreaking work in this area was 
performed by Borriello and his co-workers. Their three papers at ICCAD on 
this topic truly form the foundation of all the efforts in this space ([36-38]). A 
wide range of papers have been published in subsequent years (for example, 
[39], [40]), some of which were the basis of EDA start-up’s such as Coware. 
When put in the context of platform-based design, we see that interface synthe- 
sis evolves into the concept of communication-based design, which is perhaps 
one of the key disciplines to improve design productivity at the “system” level. 

An important component of design-space exploration in the HSC space is 
the capability to quickly predict the performance and power dissipation of the 
software components. Creating easy-to-produce and reliable models is one of 
the main challenges here. One of the landmark papers in this area, included in 
this volume at page 129, is the 1994 contribution of Tiwari, Malik and Wolfe 
[41]. It addresses the power modelling of software and is based on a set of care- 
fully chosen power consumption measurements of the individual instructions of 
a core processor. The method also includes the interaction between different in- 
structions and estimates power losses due to cache misses. While pragmatic, the 
approach represents great progress with respect to a complete simulation of the 
processor at the gate and register level, and provides a very effective methodol- 
ogy for power modelling. It has since has been used substantially in academia 
and industry alike. 
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In 1995 the same research group proposes an ILP method, implemented in 
the CINDERELLA tool, to find a tight bound on the worst-case execution time 
of embedded software, including a direct-mapped instruction cache [42]. This 
work was extended upon by Ernst et al. in 1997 [43]. Using symbolic simulation 
and formal techniques, this paper further tightens the timing bounds. The result- 
ing precision enables a substantial reduction in performance overhead, while 
leading to provably correct system and/or interface timing in embedded systems. 
Along the same lines, Mooney and DeMicheli [44] present synthesis techniques 
that map multi-task graph representations on a HW/SW architecture whereby 
an RTOS scheduler is synthesized such that hard real-time constraints are pre- 
dictably met. In 1998, Li and Wolf [45] introduce the first hardware/software 
co-synthesis approach to optimize the memory hierarchy along with the rest 
of the architecture of a real-time distributed system. As caches and memories 
typically represent a significant fraction of the cost, size, weight, and power 
consumption of an embedded system, this approach can substantially reduce the 
system cost. Authors of the same group later extend the synthesis technique 
to include bus arbitration in single-bus multiprocessor embedded systems [46]. 
In the section on platform-based design, we will see that the current trend is to 
move away from single bus systems to networked multiprocessor architectures. 

But first let us focus on another trend that is clearly visible at the level of the 
programmable processors themselves. 

2.5 Synthesising and Programming 

Application-Specific Instruction Processors 
(ASIPs) 

Throughout most of the System-on-a-Chip (SoC) era, designers have chosen 
to integrate IP (intellectual property) versions of traditional discrete processors 
into their systems. In recent years, it became clear that it is possible to obtain 
better performance for lower power dissipation by embracing domain-specific 
processors with greatly enhanced parallelism and specialised instruction sets. 
These processors are usually of the VLIW type. The crucial challenge then 
becomes to create a design methodology and tool set that allows for the descrip- 
tion or synthesis of an Instruction Set Architectures (ISA) and the automatic 
generation of a retargetable compiler, which produces code of a quality and size 
rivalling a good assembly programmer. Pioneering work in this area goes back 
to the MIMOLA design system originally developed by Zimmerman and Mar- 
wedel [6], the FLEXWARE system of Paulin and Liem, and the CHESS system 
of Goossens et al. An excellent overview of this early work can be found in [47]. 
Several major contributions in this area were presented at ICCAD over the last 
decade. 
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In 1994, Huang and Despain [48] presented a method for combining instruc- 
tion set design, microarchitecture design, and instruction-set mapping within a 
single formulation: a simultaneous scheduling/allocation problem with an inte- 
grated instruction formation process. The formulation takes as inputs the ap- 
plication, architecture template, objective function and design constraints, and 
generates as outputs the instruction set, resource allocation (which instantiates 
the architecture template) and assembly code for the application. Choi et al 
[49] introduce a new approach to generate application-specific instructions for 
DSP applications in 1998. The proposed approach supports for the first time 
multi-cycle complex instructions combined with single cycle instructions. A 
complete design system for ASIPs based on the LISA language, developed for 
the description of ISAs, was introduced in 2001 [50]. 

As stated earlier, VLIW architectures are popular for ASIPs as they allow a 
high degree of instruction-level parallelism, reflecting the concurrency that is 
typically present in the applications at hand. In particular, one paper presented 
at ICCAD on this topic deserves special attention. In a paper selected for this 
volume (page 159), Jacome et al. [51] present a design technique that creates 
an optimal VLIW architecture consisting of clusters of customized functional 
units with local register files communicating over an interconnect network. This 
technique allows an exploration of the design space where latency is traded off 
against power and clock speed. Given a kernel, the proposed algorithm explores 
the space of feasible clustered data paths and returns: a data-path configuration; 
a binding and scheduling for the operations; and a corresponding estimate for 
the best achievable latency over the specified family. It is demonstrated that 
clustering functional units around a register file, hence reducing the intercon- 
nect requirements, can have a striking effect on power and performance of the 
processor. It is interesting to observe how this paper extends concepts that were 
popular in the HLS for hardwired data-paths (e.g. [22]) to programmable VLIW 
processors. 



2.6 The Advent of Platforms and 
Communication-Based Design 

In the mid-nineties, we entered the age of deep-submicron and chips could 
now easily house a number of programmable processors connected to one or 
more buses, coprocessors, hardware accelerators (IP), and reconfigurable fab- 
rics. While this creates enormous opportunities from an application and product 
perspective, it also poses formidable challenges. The design complexity of these 
components is quite daunting and leads to huge NRE costs. Add to that the huge 
mask costs associated with fabrication in 150 nm and below technology nodes. 
All these considerations advocate multiple reuse of the same chip or chip archi- 
tecture for a given application domain. 
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The solution for these challenges is a design methodology called platform- 
based design that facilitates design reuse by abstracting hardware to a higher 
level (platforms) that is visible to the application software. The hardware plat- 
form should comprise a family of flexible (parameterizable) architectures that 
adequately support the functions in the application space. Also, a software plat- 
form is needed to abstract the hardware platform into a programmer’s model to 
allow effective mapping. This union of hardware and software platforms is now 
called an architectural platform. Once a system platform has been identified for 
the application space and the architecture space, the final chip design involves 
design exploration within the architecture platform to determine the best map- 
ping of application to architecture. While this approach is steadily emerging as a 
dominant trend in system-level design, it has received minor attention at ICCAD 
so far (besides a tutorial in 2001 [52]). The most prominent presence at the con- 
ference was in the area of communication-based synthesis. Interconnecting a 
large number of complex components on a single die, such that functionality, 
performance and power constraints are met, is one of the main challenges of 
System-on-a-Chip design. More and more, SOCs resemble networked multi- 
processor architectures. This so-called Network-on-a-Chip approach is just in it 
infancy, but has resulted in complete sessions at recent DAC and ICCAD confer- 
ences. The paper presented by Carloni et al. in 1999 at the conference [53], and 
included in this volume, puts an interesting perspective on how to address the on- 
chip global interconnect. It proposes a synthesis methodology for synchronous 
systems that makes the design functionally insensitive to the latency of long 
wires. Given a synchronous specification of a design, a functionally equivalent 
synchronous implementation that can tolerate arbitrary communication latency 
between designed registers is generated. Using automatically inserted registers, 
a long wire is broken into short segments which can be traversed in a single 
clock cycle. This results in a design that is robust with respect to delays of long 
wires. The method is demonstrated using an out-of-order microprocessor with 
speculative-execution. The important message of the paper is that the intercon- 
nect problem is most adequately addressed in a top-down fashion, interpreting 
the functional constraints and translating them into a network architecture that 
easily meets the circuit-level timing requirements. While currently in the early 
stages of research, we expect platform-based design to dominate the system- 
level design sessions at the conference in the years to come. 

3. Quo Vadis (or Where are We Heading)? 

Compared to most of the other topical areas in design automation, system- 
level design is a relative newcomer, having only a minor presence in the ear- 
lier ICCAD conferences. In addition, the commercial success of most of the 
early efforts in this space has been rather limited (to phrase it nicely). Earlier in 
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the paper, we identified the three necessary components for a successful design 
methodology in the system design space: a specification language or environ- 
ment that covers the application space of interest, an effective target architec- 
ture, and the means to translate that architecture into a physical artifact. It is 
our conjecture that these components were not simultaneously present, except 
in some narrow application spaces where it is hard to recuperate the cost of the 
tool development. Looking at the present and the future, it seems that the time 
for effective and widespread system-level design methodology has finally come. 
The recent levels of understanding of the models-of-computation [54-56] enable 
the capture of system-level specifications in a general sense (getting away from 
the stovepipe systems of the past), programmable platforms present a clear and 
well-defined architectural target, and the route from architecture to silicon is 
well-understood (although under some challenging pressures at present). 

What the future will hold is hard to surmise. However, looking down the 
semiconductor roadmap we can identify some important trends. Flexibility and 
reuse (often called programmability) will become evermore important, systems- 
on-chip (or systems-in-a package) will become ever more heterogeneous com- 
bining mixed signal and hybrid technologies. Power and energy constraints will 
limit the level of integration that can be reached as well as force us to look at in- 
novative systemarchitectures, and the reliability and robustness of the integrated 
systems will be severely challenged. Expect these topics to be present in a big 
way in the ICCADs of the future. 
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Abstract 

In this paper, a microcode compiler for custom DSP-processors is presented. This tool is part of 
the CATHEDRAL II silicon compiler. Two optimization problems in the microcode compilation 
process are highlighted : microprogram scheduling and memory allocation. Algorithms to solve 
them, partly based on heuristics, are presented. Our compiler successfully handles repetitive pro- 
grams, and is able to decide on hardware binding. In most practical examples, optimal solutions 
are found. Whenever possible, indications of the complexity are given. 



1. Introduction 

The CATHEDRAL II silicon compiler aims at the automatic synthesis of 
synchronous multi-processor ASICs for DSP-applications [3] [4]. It proposes 
a full separation between architecture synthesis and layout generation {meet- 
in-the-middle philosophy). The architecture synthesis tools transform an ap- 
plicative behavioural representation of a DSP-algorithm, into a structural de- 
scription of multi-processor data paths [3], and a procedural microcoded con- 
troller definition. In a first step, a customized data path, composed of well- 
characterized parametrizable modules, is generated by the rule-based synthesis 
program JACK-THE-MAPPER [9]. At the same time, the behavioural repre- 
sentation is translated into an applicative register-transfer (RT) description. In 
a second step, the microcode compiler ATOMICS^ transforms the applicative 
RT-program into a procedural finite-state machine description. The output of 
this tool is linked with a PLA-based controller compiler, performing state op- 
timization and layout generation, and with an RT-level simulator. In this paper 
we will describe the microcode compiler ATOMICS, with emphasis on efficient 
optimization algorithms for microprogram scheduling (section 2) and memory 
allocation (section 3). 

The applicative RT-input language of ATOMICS is characterized by a highly 
architecture-independent format, and powerful control constructs, e.g. FOR- 
loops and conditional statements. In many compilers, loop constructs are not 
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supported since they cause certain optimization problems to become hard. In the 
sequel, RT-programs (not) containing loops will be referred to as (non-) repet- 
itive programs. ATOMICS is based on a parametrizable multiple-branch con- 
troller model with horizontal microcode, allowing for simultaneous data path 
actions and jump address computation [3]. The current version of ATOMICS 
does not perform any transformations that create, delete or alter RTs from the 
input description. 

2. Microprogram scheduling 

In order to capture the rime-concept in a repetitive RT-program, the poten- 
tial of an RT is defined as the RT’s machine-cycle number in an imaginary 
implementation, in which the body of every FOR-\oop is executed once. The 
goal of microprogram scheduling then is to map the individual RTs to poten- 
tials, thereby exploiting the inherent parallelism in the DSP-algorithm in order 
to minimize the global machine-cycle count. In order to reduce the complex- 
ity of the scheduling problem for repetitive RT-programs, loops are scheduled 
hierarchically, starting at the deepest level of nesting. 

2.1 A constrained optimization problem 

The search space of the optimization is bounded by three categories of con- 
straints : 

1 Forward data-precedences, expressing the need for a minimal (integer) 
delay 6 between any pair of RTs and B, respectively writing and reading 
a variable to/from a storage element. This can be denoted : 

PB{j)>PA{i)+^{j) ( 1 ) 

where pa{j) and pbU) represent the potentials of A and B respectively, 
in constraint j. Usually, 5 equals 1 cycle. Dependencies between condi- 
tional RTs and RTs producing the according condition variables are also 
modelled in this way. In this case 5 is usually larger than 1, due to the 
internal controller pipelining. 

2 Resource-allocation constraints (or conflicts), expressing the requirement 
that certain RTs sharing hardware resources (e.g. data path operators, 
memory, busses) cannot be scheduled at the same machine cycle. If C 
and D represent the conflicting RTs in constraint i, the latter can be de- 
noted : 

Pc{i) 7^ Poii) (2) 

3 Looping data-precedences, expressing a precedence relation in a FOR- 
loop, between a writing RT in the current loop iteration and a reading RT 
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in the next iteration. Looping precedence / between RTs A and B can be 
denoted : 



PB{f) > PaU) + 5(/) - (Ap + 1) (3) 

where Ap+ 1 is the total number of potentials spanned by the loop. (3) is 
only relevant when 8(/) > 1. 

No constraints occur between conditional RTs with disjoint condition fields. 
Two graph representations will be used (e.g. Fig. 1(a)). In both cases, RTs are 
modelled as vertices. Forward data-precedences are modelled as arcs in a di- 
rected weighted graph, from the writing to the reading RT’s vertex, with a weight 
equal to the delay (data-precedence graph). Resource-allocation constraints are 
represented as arcs in an undirected graph {resource-allocation graph). 
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Figure 1. Scheduling of non-repetitive RT-program : (a) data-precedence graph (solid) and resource- 
allocation graph (dotted). Numbers indicate weights of data-precedences; (b) scheduling based on critical 
path criterion for conflict resolution. 



Scheduling problems are known to be NP-complete. (1), (2) and (3) can 
e.g. be reformulated as an integer linear program (ILP) [3]. Efficient heuris- 
tic scheduling algorithms from the theory of project management [2] have been 
applied to the DSP-domain by Zeman [10], for the subclass of non-repetitive 
RT-programs. We will present an iterative scheduling algorithm, based on [10], 
which allows to schedule repetitive programs in an automatic and efficient way. 
In this paper, we assume that storage elements in the data path may contain 
multiple storage-fields (e.g. RAM, ROM, local register-files). Although not de- 
scribed, our iterative scheduling technique allows to model other memory ele- 
ments too, such as pipeline latches. 
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2.2 Scheduling non-repetitive programs 

We will review Zeman’s scheduling technique, for non-repetitive programs, 
as an extension of classical levelling of the data-precedence graph. The lev- 
els of the vertices, correspond to the potentials assigned to the RTs. A global 
search method (levelling), used to take into account the forward data-preceden- 
ces, is combined with local heuristic searches to take into account the NP-hard 
resource-allocation constraints. All potentials are treated consecutively, starting 
from 0. Inspection of the data-precedence graph reveals which RTs are ready 
to be assigned the current potential. These RTs are placed in a waiting list. 
If resource-allocation conflicts occur, a heuristic selection criterion is used to 
derive a conflict-free subset from the waiting list. The non-selected RTs are 
shifted to the waiting list of the next (higher) potential. In ATOMICS, a critical- 
path criterion is used : the scheduling priority of an RT involved in a conflict 
at the current potential, is measured by the length of the critical path in the 
data-precedence graph, emerging from the RT. Modifications of the criterion 
are discussed in [6]. The algorithm is illustrated in Fig. 1(b). Compared to a 
levelling algorithm, only minor extra CPU-time is spent : the complexity of a 
critical path analysis is linear in the number of vertices in the subgraph under 
consideration. 

2.3 Scheduling repetitive programs 

Below, we will extend Zeman’s scheduling algorithm, in order to schedule 
repetitive programs. An iterative procedure allows to take into account the ad- 
ditional looping precedences. The convergence is guaranteed, and acceptable 
bounds on the number of iterations can be derived. 

In the algorithm for non-repetitive RT-programs, the forward precedences (1) 
can be interpreted as a specification of a lower bound on the potential pb of an 
RT B which is ready to be scheduled. This bound is a function of the potentials 
PaU) of all precedent RTs, i.e. all RTs pointing to B via a forward precedence, 
and of the associated delays. By definition, B is ready to be scheduled when 
all of its precedents A have been scheduled; hence at this point the bound can 
be computed explicitly, and equals the potential of the waiting list in which 
B will occur for the first time. In a repetitive program, an additional bound 
on Pb is provided by the looping precedences (3). In this expression, Ap can 
be estimated (see below); the potentials PaW) are however unknown during 
the treatment of the theoretically optimal potential of B. Therefore, the bound 
cannot be computed explicitly. In ATOMICS, this problem is solved by calling 
Zeman’s algorithm repeatedly in an iteration process : in order to compute the 
lower bounds (1) and (3) on pb during iteration k, the potentials PaU') are taken 
from the previous iteration k-\, whereas the potentials PaU) are taken from 
the current iteration, since they are known in time. 
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According to experiments, the convergence of the iteration process is criti- 
cally dependent on the estimate of A/? [6] : a too small choice of Ap results in 
an empty solution space for the optimization, which is manifested as divergence 
of the iteration. Divergence is observable from certain vertices in the graph [6]. 
A too large choice of Ap results in convergence to a sub-optimal solution. A 
reliable strategy to determine the optimal Ap is to call the previous algorithm 
repeatedly in a second iteration loop over Ap. The optimal Ap is determined in 
a limited number of steps, via a binary or incremental search between a reliable 
lower and upper bound on Ap. An upper bound on Ap is e.g. the sum of all 
delays in the precedence graph; this is then also a (pessimistic) bound on the 
number of outer-loop iterations. 

An example is given in Fig. 2. With our algorithm, the optimal schedule 
has been found in numerous designs. Applications include a 4-processor pitch- 
extractor for speech, an adaptive interpolator for error correction in Compact 
Disc, and an echo canceller for digital telephony. Some results are listed in 
Table 1. 
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Table 1. Design figures for ATOMICS scheduler. 



2.4 Automated hardware-assignment during 
scheduling 

A designer often might wish to use multiple instances of a certain hardware 
operator in the data path (e.g. ALUs, multipliers,...). The decision on the num- 
ber of instances of each operator will be termed the hardware allocation. When 
multiple instances of an operator have been allocated, the next step in the de- 
sign is to bind each RT to one or more specific instances. This problem will 
be termed the hardware assignment. Below, an optimized hardware-assignment 
technique will be introduced, which is based on the scheduling algorithm. Full 
details can be found in [7]. For a given allocation, ATOMICS will try to find 
an assignment which minimizes the machine-cycle count. For this purpose, an 
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(a) = 5 (initial estimate based on non-repetitive schedule) Pq^ = 0 
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(b) Ap^’^ = 6 (incremental search) Pq = 0 
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Figure 2. Scheduling of repetitive program, obtained from Fig. 1, by adding looping precedences G-A 
and out - in, with delays 4 and 1 respectively : (a) 1st outer iteration step, consisting of 1 inner iteration 
step ; (b) 2nd outer iteration step, consisting of 2 inner iteration steps. (The selection index is a modified 
critical-path measure.) 
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RT input-description has to be provided in which all RTs are formally using 
differently named operator-instances, called formal instances. ATOMICS will 
then merge these formal instances into the actual number of instances, specified 
in the allocation. Automated hardware-assignment is non-trivial with architec- 
tures in which single RTs may cover multiple operators. E.g. in CATHEDRAL H 
data paths, register-files are placed locally at the inputs of operators [3]; there- 
fore the assignment of each RT requires choosing both a source operator (for 
fetching and modifying source data) and a destination operator (for storing the 
result). In general, the assignment of RTs involved in a data-precedence, cannot 
be performed independently of each other. This fact encumbers the locality of 
the search process. 

In order to combine hardware assignment with scheduling, two types of resource- 
allocation conflicts have to be distinguished : soft conflicts, occuring between 
RTs covering different formal instances of an operator; and (classical) hard con- 
flicts, occuring between RTs sharing the same actual instance of an operator. 
RTs which are involved in a soft conflict, may be scheduled on the same poten- 
tial, as long as sufficient actual operator-instances are available to execute them 
in parallel. In this case, the final assignment should obey the condition that the 
formal instances which caused the soft conflict, may not be merged into one 
actual instance. Such condition is termed a merging constraint. 

As in section 2.2, ATOMICS will process consecutive potentials, starting from 
0. At every potential, with the help of the critical path criterion, a maximal 
subset of RTs is derived from the waiting list, which contains a) no hard conflicts 
and b) only soft conflicts that don’t introduce an inconsistent (unsolvable) set 
of merging constraints for the given allocation. A set of merging constraints 
can be represented as arcs in an undirected graph, termed merging graph, with 
formal operator-instances as vertices. The problem of checking the consistency 
of a set of merging constraints at every potential [7] is equivalent to finding an 
acceptable vertex colouring [5] of the merging gfaph, in which the number of 
colours corresponds to the the number of allocated operators. Although vertex 
colouring is NP-complete, the limited size of the merging graph usually allows 
to apply a simple branch and bound algorithm [7]. Every acceptable vertex 
colouring, of the merging graph obtained after completion of the scheduling 
operation, provides a valid hardware-assignment. The cheapest colouring in 
terms of interconnection cost can then be chosen. 

3. Memory allocation 

When the RT-program has been scheduled, an allocation of variables to stor- 
age fields can be carried out, aiming at a minimal number of fields. In ATOM- 
ICS, this allofcation is used to dimension register-files in the data path. The set of 
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potentials during which a variable is to be stored is called the variable’s lifetime 
[1]. The lifetime of a variable in a non-repetitive program can be modelled as 
an integer interval (Fig. 3). In the repetitive case however, lifetimes may have 
to be defined as the union of non-overlapping intervals [6] (Fig. 4). One storage 
field may be allocated to different variables if they have disjoint lifetimes. 
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Figure 3. Lifetime intervals of variables in a register-file, an arbitrary (top) and an optimal register- 
allocation (bottom). 




Figure 4- Register-allocation for repetitive program. 



The general memory allocation problem is equivalent to clique partitioning, 
which is NP-complete [8]. However, for the class of non-repetitive programs, 
a globally optimal allocation can be found in linear time, with the algorithm 
presented in Table 2. For an optimality proof, we refer to [6]. The analysis is 
demonstrated in Fig. 3 (bottom). 

This algorithm requires that lifetime intervals be continuous. For the class of 
repetitive programs, ATOMICS uses an efficient heuristic procedure, incorporat- 
ing the previously described algorithm. This procedure consists of two steps ; 

■ Consider the variables with a lifetime composed of several non-overlapping 
intervals. A heuristic graph-based algorithm [6] allows to “fill the gaps” 
in these lifetimes, with smaller continuous lifetime intervals of other vari- 
ables. Both the “filling” and the “filled” variables are assigned to the same 
storage-field. In this way, discontinuous lifetimes are formally replaced 
by single continuous intervals. 
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p := minimal potential; 

create new storage-field /; 

while there exist unassigned variables do 

if there exist unassigned variables v/, coming alive at p then 
select arbitrary v/; 
assign v,- to /; 

p := endvalue of v, ’s lifetime 
else 

p:=p-M 
end if; 

if p >= maximal potential then 
p := minimal potential; 
create new storage-field / 
end if 
end do. 



Table 2. Optimal register-allocation for non-repetitive program. 

■ Next, the optimal allocation algorithm for non-repetitive programs is ap- 
plied. 

With this procedure we succeeded in finding the optimum in almost any practical 
design, in very low CPU-time. An example is given in Fig. 4. As has been 
observed in [8], direct application of general heuristics for clique partitioning 
often produces suboptimal solutions to the memory allocation problem. In Fig. 
5, a special configuration is shown, for which the results of the basic clique 
partitioning algorithm presented in [8] are compared with our technique. 




Figure 5. Example of register-allocation, comparing our technique (b) with the approach described in 
[8] (a). 



4. Conclusions 

The microcode compiler ATOMICS for custom DSP-systems has been pre- 
sented. The problems of scheduling (with an extension towards automated 
hardware binding) and memory allocation, have been emphasized. Efficient 
algorithms have been presented, which are partly based on heuristics, to com- 
pute quasi-optimal solutions in very low CPU-time. Loop-constructs in the ap- 
plicative input description are supported. Current research is aimed at further 
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optimization of repetitive programs, e.g. by allowing overlaps in time between 
successive loop-iterations. Furthermore, it is investigated how memory opti- 
mization can be taken into account during scheduling. 

Notes 

1 . “A Tool for Optimized Micro-Instruction Compilation and Scheduling”. 
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Abstract 

The increasing demand for “portable” computing and communication, has elevated power con- 
sumption to be the most critical design parameter. An automated high-level synthesis system, 
HYPER-LP, is presented for minimizing power consumption in application specific datapath in- 
tensive CMOS circuits using a variety of architectural and computational transformations. The 
sources of power consumption are reviewed and the effects of architectural transformations on 
the various power components are presented. The synthesis environment consists of high-level 
estimation of power consumption, a library of transformation primitives (local and global), and 
heuristic/probabilistic optimization search mechanisms for fast and efficient scanning of the de- 
sign space. Examples with varying degree of computational complexity and structures are opti- 
mized and synthesized using the HYPER-LP system. The results indicate that an order of magni- 
tude reduction in power can be achieved over current-day design methodologies while maintain- 
ing the system throughput; in some cases this can be accomplished while preserving or reducing 
the implementation area. 



1. Introduction 

The major VLSI design and research efforts until now have been focused on 
optimizing speed to realize computationally intensive real-time tasks such as 
video compression and speech recognition. As a result, many systems have suc- 
cessfully integrated various complex signal processing modules meeting users 
computation and entertainment demands. While these solutions have provided 
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answers to the real-time problem, they have not addressed the rapidly increasing 
demand for portable operation. This strict limitation on power dissipation which 
portability imposes, must be met by the designer while still meeting ever higher 
computational requirements. 

An example of a system requiring portability, moving beyond today’s portable 
computers, is a future personal communications terminal described in [1] that 
will support speech communication and recognition, data transfer, computing 
services, and high-quality, full-motion video. The intense computational nature 
of the terminal functions coupled with the requirement of portability will place 
severe constraints on the total power being consumed. For example, a basic 
terminal with speech recognition and video decompression units implemented 
using current day technology will require about 201bs of battery for lOhrs of op- 
eration [1]. Clearly, more power efficient means for implementing these func- 
tions need to be developed. One major degree of freedom available in optimizing 
design for such applications is that once real-time operation is achieved, there is 
no advantage in making the computation any faster. The goal then becomes one 
of reducing the power consumption while maintaining the system throughput. 

Fortunately, the increasing density of VLSI systems, due to sub-micron fea- 
ture size scaling and high-density packaging such as multichip modules, has 
enabled the development of an architectural strategy which can be used to trade- 
off area and power for a fixed throughput [2]. 

In this work, we attack the problem of automatically finding computational 
structures that results in the lowest power consumption for a specified through- 
put given a high-level algorithmic specification. The basic approach is to scan 
the design space by utilizing various flowgraph transformations, high-level power 
estimation, and efficient heuristic/probabilistic search mechanisms. While trans- 
formations have been successfully applied recently in high-level synthesis with 
the goal of optimizing speed and/or area, the problem of power optimization has 
not been addressed. It will be shown that optimizing for power using transforma- 
tions requires a different strategy than those used for speed or area optimization. 

2. Sources of Power Consumption 

In CMOS technology, there are three sources of power dissipation arising 
from: switching (dynamic) currents, short-circuit currents, and leakage currents. 
The switching component, however, is the only one which cannot be made neg- 
ligible if proper design techniques are followed. The power consumption due to 
the switching of a CMOS gate with a load capacitor. Cl, is given by the follow- 
ing formula [3]: 

PswUching =Pt{CL*Vjd*f) ( 1 ) 

where / is the clock frequency, V^d is the supply voltage, and pt, is the prob- 
ability of a power consuming transition (or the activity factor). Probabilistic 
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Figure 1. Plot of normalized delay vs. Vdd- 

approaches have been proposed to estimate the internal node activities of a net- 
work given the distribution of the input signals (i,e. the probability the value is 
a‘T’or“0”)[4, 5]. 

In our analysis and optimizations, we will refer to the energy per computation 
of a gate or module (e.g. an adder), which is given by: 

Energy = Ptotal/ fdk ~ ^avg^dd 

where Cavg is the average capacitance being switched per clock cycle (i.e Cavg = 
Pt * Cijotal)- 

The energy consumed by a logic block per computation is therefore a quadratic 
function of the operating voltage, as verified experimentally for a number of 
logic functions and logic styles in [2]. It is clear that operating at the lowest 
possible voltage is most desirable, however, this comes at the cost of increased 
delays and thus reduced throughput. This is seen from Figure 1 which shows 
an experimentally derived plot of normalized delay vs. Vdd for a typical CMOS 
gate. Once again, the delay dependence on supply voltage was verified to be 
relatively independent of various logic functions and logic styles [2]. 

By modifying the architecture through a variety of transformations, however, 
the throughput can be regained, and thus a power savings can be accomplished 
while retaining the required functionality. It is also possible to reduce the power 
by choosing an architecture that minimizes the effective capacitance; through 
reductions in the number of operations, the average transition activity, the in- 
terconnect capacitance, and internal bit widths and using operations that require 
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less energy per computation. It is these two strategies which will be pursued to 
minimize the power dissipation. There is, however, a strong interaction between 
optimizing capacitance and voltage for a fixed throughput. A lot of transforma- 
tions will have conflicting effects on these parameters making the optimization 
a non-trivial task. 

3. Transformations for Optimizing Power 

Transformations are changes to the computational structure in a manner that 
the input/output behavior is preserved. The use of transformations makes it pos- 
sible to explore a number of alternative architectures and to choose those which 
result in the lowest power. A brief summary of transformations for optimiz- 
ing power is presented in this section. A detailed discussion of the effects of 
transformations on power is presented in [6]. 

3.1 Critical Path Reduction 

This is probably the single most important type of transformations for power 
reduction. It is not only the most common type of transformation, but also often 
has the strongest impact on power. The basic idea is to reduce the critical path, 
so that supply voltage can be lowered while keeping the throughput fixed. The 
reduction of critical path is most often possible due to the exploitation of con- 
currency. Many transformations profoundly affect the amount of concurrency in 
the computation including pipelining and loop unrolling. 

To illustrate the reduction of power using speedup techniques, consider a 
module with capacitance C running at a maximum frequency of / @ 5V (-> 
= C{5)^f). By transforming this structure to a parallel architecture with 
two identical units (unrolling), the clock frequency can be dropped to half the 
original rate while maintaining the original throughput. Since the modules have 
twice the available time as the original case, the voltage can be dropped to 
2.9V(where the delays increase by a factor of 2, Figure 1). The power of the 
transformed solution is Ppar = 2C(2.9)^//2 and a factor of (5/2.9)^ reduction 
in power is achieved without sacrificing performance. The above analysis as- 
sumed that there is no overhead for parallelizing. In reality, the overhead due 
to routing and control must be taken into account. Even so, the quadratic de- 
pendence of voltage on power usually more than compensates for the increase 
in capacitance resulting in an overall reduction of power However at very low 
voltages (< 1.5V), the delays (and hence the overhead circuitry) increase very 
rapidly, causing the power to increase with further reduction of the supply volt- 
age [2]. 
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3.2 Reducing the Number of Operations 

The most obvious approach to capacitance reduction, is to reduce the num- 
ber of operations (and hence the number of switching events) in the data con- 
trol flow graph. While this almost always has the effect of reducing the effec- 
tive capacitance, the effect on critical path is case dependent. Transformations 
which directly reduce the number of operations in a data control flow graph 
include: common subexpression elimination, manifest expression calculation, 
loop merging, and distributivity. 

3.3 Reducing the Transition Activity 

Designs using static CMOS logic can exhibit spurious transitions due to fi- 
nite propagation delays from one logic block to the next (called critical races 
or dynamic hazards), i.e. a node can have multiple transitions in a single clock 
cycle before settling to the correct logic value. The amount of extra transitions 
is a complex function of logic depth, input patterns, and skew. To minimize the 
“extra” transitions and power in a design, it is important to balance all signal 
paths and reduce the logic depth. For example, consider the two implemen- 
tations for adding four numbers shown in Figure 2 (assuming a non-pipelined 
implementation). Assume that all primary inputs arrive at the same time. Since 
there is a finite propagation delay through the first adder for the chained case, 
the second adder is computing with the new C input and the previous output of 
A + B. When the correct value of A + B finally propagates, the second adder 
recomputes the sum. Similarly, the third adder computes three times per cycle. 
In the tree implementation, however, the signal paths are more balanced and the 
amount of extra transitions is reduced. The capacitance switched for a chained 
implementation is a factor of 1.5 larger than the tree implementation for a four 
input addition and 2.5 larger for an eight input addition. The above simulations 
were done on layouts generated by the LagerlV silicon compiler [7] using the 
IRSIM [8] switch-level simulator over 1000 random input patterns. 




Figure 2. Reducing the glitching activity. 
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3.4 Reducing the Interconnect Capacitance 

It is often possible to reduce the required amount of hardware, while preserv- 
ing the critical path (or number of control steps) [9]. This is possible because 
after certain transformations, operations are more uniformly distributed over the 
available time, and thus a denser scheduling (effective utilization of the hard- 
ware) can be achieved. These transformations include retiming for resource uti- 
lization, associativity, distributivity and commutativity. Smaller capacitance is 
achieved because there are fewer interconnects and/or fewer functional elements 
and registers, which are obstacles during floor-planning and routing, which in- 
directly influence interconnect area and capacitance. 

3.5 Operation Substitution 

Certain operations inherently require less energy per computation than other 
operations. A prime example of this is strength reduction, often used in software 
compilers, in which multiplications are substituted for additions. 

Another powerful transformation in this category is converting multiplica- 
tions with constants into shift-add operations. Since multiplications with fixed 
coefficients are quite common in many signal processing applications (DCT, 
FFT, various types of filters, etc.), this transformation can prove to be beneficial. 

3.6 Bit-width Optimization 

The number of bits used can strongly affects all the key parameters of a de- 
sign, including speed, area and power. A smaller bit-width typically results in 
fewer switching events (and hence lower capacitance), faster circuits (and hence 
lower supply voltage), and smaller area (and hence lower average interconnect 
length). Certain transformations (e.g. associativity and distributivity) can have 
a profound impact on the bit-width. 

4. Power Estimation 

The goal is to develop an objective function that is highly correlated to the 
final (and unknown) power dissipation of the circuit. The objective function 
should be very easy to compute since it has to be evaluated many times during 
the optimization process. Elaborate power estimation, while being much more 
accurate, will require the hardware mapping and compilation steps to convert 
a flowgraph to layout, making it impractical during the optimization process. 
Hence a model correlated to the power must be developed strictly from the flow- 
graph level. 

The goal of power optimization in this work is to keep the throughput con- 
stant by allowing the supply voltage to vary. Given that the sample period is 
fixed, power optimization is equivalent to minimizing the total energy switched. 
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Qotaiy^< where V is appropriate voltage required to meet the initial throughput 
rate. 



Estimating the total capacitance being switched involves considering four 
components: 



The capacitance estimation is built on top of an existing estimation routine in 
HYPER that determines the bounds and activity of various execution, register 
and interconnect components as well as the active implementations area [10]. A 
brief description of the estimation routines is presented below. 

4.1.1 Execution Units and Registers. The capacitance con- 
tributed due to the execution units is determined by multiplying (over all types 
of units utilized) the number of times the operation was performed per sample 
period with the average capacitance of the unit type. The total capacitance is 
hence given by: 



where numtypes is the total number of operation types, Ni the number the times 
the operation of type i is performed per sample period, and Q is the average 
capacitance being switched per operation of type i. The average capacitance for 
the various modules has been characterized as a function of bit-width (through 
SPICE and IRSIM simulations) for a uniformly distributed set of inputs. In 
general the probabilities are not uniform, however, this assumption is made to 
simplify the cost evaluation. The effect of inter-module capacitance (between 
modules inside a datapath) is taken into account by incorporating an average 
loading capacitance during the characterization of the leaf-cells. It is important 
to note that the contribution due to the execution units is relatively independent 
of resource utilization (or the degree of time-sharing) since the required number 
of operations must be performed within the sample period. However, the amount 
of parallelism will affect the interconnect capacitance. 

Registers are treated the same way as the execution units. The number of reg- 
ister accesses per sample period (read/write) is multiplied with the capacitance 
per register access to yield a register contribution given by Nregisters • C register 
Once again, while the total number of registers is not important in calculating 
the register switching capacitance, it will affect floorplanning and therefore the 
interconnect capacitance. 



4.1 Capacitance Estimate 




(3) 



numtypes 

Cexu= t 
i=l 



(4) 
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4.1.2 Interconnect. A relatively accurate model for interconnect 
capacitance is important when performing power trade-offs since often the in- 
terconnects starts to dominate over the logic capacitance and restricts improve- 
ment in power that can be achieved. Determining the interconnect capacitance 
is a difficult task since we have to in essence emulate the partitioning, place, and 
route. 

The goal of interconnect estimation is to estimate the inter-block (between 
macro blocks, such aa between datapaths) routing capacitance. A statistical 
model based approach is used to predict the inter-block capacitance from high 
level parameters such as the number of global interconnects, active area and bit- 
width. This model for estimating the average interconnect capacitance switched 
requires the number of interconnects, N, the average activity, a, and the aver- 
age interconnect length, L. The average interconnect length is obtained from 
statistical estimates of the final routed chip area and typical interconnect length 
distributions as a function of area [11]. The interconnect capacitance is then 
estimated as: 



^interconnect — CL* N * L* B * (5) 

where Cl is the capacitance per unit length and B is the bit-width. A more 
extensive model for the interconnect is currently being developed. 

4.1.3 Control Logic. From several circuits, it was observed that 
there is a strong correlation between the control capacitance switched and the 
total relevant capacitance (muxes, tri-states, registers, and any other module that 
requires control signals). Based on this information, a simple model is used 
to predict the total control capacitance as a function of high-level parameters. 
Notice that control contribution will be a function of the architecture style used. 

4.2 Supply Voltage Estimation 

We are interested in computing the power supply voltage at which the trans- 
formed flowgraph will meet the timing constraints. The initial flowgraph which 
meets the timing constraints is typically assumed to be operating at a supply 
voltage of 5V with a critical path of TinUiai (the initial voltage will be lower if 
Tinitiai < Tsampling)- After each move, the critical path is re-estimated, and the 
new supply voltage at which the transformed flowgraph still meets the time con- 
straint, Tsampling', is determined. For example, if the initial solution requires 10 
control steps running at a supply voltage of 5V. then a transformed solution that 
requires only 5 control steps can run at a supply voltage of 2.9V while meeting 
the throughput constraint. This relationship (of delay- V^,/) was modelled using 
Neville’s algorithm for rational function interpolation and extrapolation [12]. 
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5. Optimization Algorithm 

The transformation mechanism is based on two types of moves, global and lo- 
cal. While global moves optimize the whole DCFG simultaneously, local moves 
involve applying a transformation only on one or very few nodes in the DCFG. 
The most important advantage of global moves is, of course, a higher optimiza- 
tion effect; the advantages of local moves is their simplicity and small com- 
putational cost. We used the following global transformations (i) retiming and 
pipelining for critical path reduction (ii) associativity (iii) constant elimination 
and (iv) loop unrolling. In the library of local moves we have implemented three 
algebraic transformations; associativity, distributivity and commutativity. 

The computational complexity analysis of the power minimization problem 
showed that even highly simplified versions of the optimization tasks are NP- 
complete. Two widely used alternatives for the design of high quality subop- 
timal optimization algorithms are probabilistic and heuristic algorithms. Both 
heuristic and probabilistic algorithms have several distinctive advantages over 
each other. While the most important advantage of heuristic algorithms is a 
shorter run time, probabilistic algorithms are more robust and have stronger 
mechanisms for escaping local minimas. 

The algorithm for power minimization using transformations has both heuris- 
tic and probabilistic components. While the heuristic part uses global transfor- 
mations, the probabilistic component uses local moves. The heuristic part ap- 
plies global transformations one at the time in order to provide good starting 
points for the application of the probabilistic algorithm. The probabilistic algo- 
rithm conducts a probabilistic search in a broad vicinity of the solution provided 
by the heuristic part. The underlying search mechanism of the probabilistic part 
is simulated annealing. 

6. Results 

A summary of power improvement after applying transformations relative to 
an initial solution that met the required throughput constraint at 5V for several 
representative examples is shown in Table 1. The results indicate that a large 
reduction in power consumption is possible (at the expense of area) compared to 
present-day methodologies. Also interesting was the fact that the “optimal” final 
supply voltage for all the examples was much lower than existing and emerging 
standards and was around 1.5 V. 

7. Conclusions 

The problem of power minimization is becoming a very important problem 
with the increasing demand for “portable” computing and communication and 
we have presented a high-level synthesis system, HYPER-LP, for optimizing 
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Example 


Power Reduction 


Area Increase 


RGB -> YUV 


8 


5 


FIR Filter 


11 


1.1 


DCT (8 point) 


8 


5 


Speech Filter 


8 


6.4 


Elliptical Filter 


9 


2.7 


Wavelet Filter 


10 


2 


Volterra Filter 


8.6 


1 



Table 1. Summary of results. 



power consumption in application specific datapath intensive circuits using a 
variety of architectural and computational transformations. The synthesis ap- 
proach consisted of applying transformation primitives (from a library of local 
and global moves) in a well defined manner in conjunction with efficient high- 
level estimation of power consumption. The results indicate that an order of 
magnitude reduction in power is possible over current-day design methodolo- 
gies while maintaining the system throughput, and it was found that the opti- 
mal supply voltage for minimizing power was much lower than existing stan- 
dards (present-day 5V and emerging 3.3V) and was around 1.5V for most of 
the examples investigated. While this work has addressed some key problems 
in the automated design of low-power systems, there are still many open re- 
search problems like detailed power estimation, module selection, partitioning, 
and scheduling for power optimization. 
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Abstract 

Embedded computer systems are characterized by the presence of a dedicated processor and the 
software that runs on it. Power constraints are increasingly becoming the critical component of 
the design specification of these systems. At present, however, power analysis tools can only 
be applied at the lower levels of the design - the circuit or gate level. It is either impractical or 
impossible to use the lower level tools to estimate the power cost of the software component of 
the system. This paper describes the first systematic attempt to model this power cost. A power 
analysis technique is developed that has been applied to two commercial microprocessors - Intel 
486DX2 and Fujitsu SPARClite 934. This technique can be employed to evaluate the power cost 
of embedded software and also be used to search the design space in software power optimization. 



1. Introduction 

Embedded computer systems are characterized by the presence of a dedicated 
processor which executes application specific software. Recent years have seen 
a large growth of such systems. This growth is driven by several factors. The 
first is an increase in the number of applications as illustrated by the numerous 
examples of “smart electronics” around us. The second factor leading to their 
growth is the increasing migration from application specific logic to application 
specific code running on existing processors. The migration to software pro- 
grammable solutions can often provide the competitive edge in terms of lower 
manufacturing costs and shorter time to market. Thus, we are seeing a move- 
ment from the logic gate being the basic unit of computation on silicon, to an 
instruction running on an embedded processor. 

A large number of embedded computing applications are power critical, i.e., 
power constraints form an important part of the design specification. While there 
has been a significant research effort in power estimation and low power design, 
there is very little available in the form of design tools to help embedded system 
designers evaluate their designs in terms of the power metric. At present, power 
measurement tools are available for only the lower levels of the design - at the 
circuit level and the gate level. At the least these are very slow and impractical 
to use to evaluate the power consumption of software, and often cannot even 
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be applied due to lack of availability of circuit and gate level information of 
the embedded processors. The embedded processors currently used in designs 
take two possible shapes. The first is “off the shelf’ microprocessors or digital 
signal processors (DSPs). The second is in the form of cores embedded in larger 
integrated circuits. In the first case, the processor information available to the 
designer is whatever is made available through data books. In the second case, 
the designer has logic/timing simulation models to help verify the designs. In 
neither case is there lower level information available for power analysis. 

This paper describes a power analysis technique for embedded software. The 
goal is to develop and validate an instruction level power model for embedded 
software. Such a model can then be provided by the processor vendors for both 
off the shelf processors as well as embedded cores. This can then be used to 
evaluate embedded software, much as a gate level power model has been used 
to evaluate logic designs. This is useful in its own right to verify that a design 
meets its specified power constraints. In addition, it can also be used to search 
the design space in software power optimization. The technique has so far been 
applied to two commercial microprocessors - the Intel 486DX2 and the Fujitsu 
SPARClite 934. This paper uses the former as a basis for illustrating the tech- 
nique. The application of this technique for the latter is described in a separate 
reference [4]. 

2. Experimental Method 

While it is recognized that the power consumption of a processor varies from 
program to program, there is a complete lack of models and tools to analyze this 
variation. Traditional attempts to model the power consumption in the CPU rely 
on detailed physical layout of the processor and sophisticated power analysis 
tools that use the information provided by the layout. 

In the case of embedded system design, detailed layout information of the 
CPU is often not available. Even if it is available, these techniques are ex- 
pensive and difficult to apply. This is also the reason why the potential for 
power reduction through modification of software is so far unknown and unex- 
ploited. The thrust of our work is to overcome these deficiencies by developing a 
power estimation methodology based on actual laboratory measurements. Given 
a measurement setup to measure the current being drawn by the microprocessor, 
the only other information required can be obtained from the widely available 
manuals and handbooks specific to that microprocessor. 

The main idea is to formulate an instruction level power model for the micro- 
processor. Given this model and an assembly/machine level program, the power 
consumption in the program can be efficiently estimated. The specifics of the 
measurement methodology are described next. 
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2.1 Power and Energy 

The average power consumed by a microprocessor while running a certain 
program is given by: P = / x Vcc, where P is the average power, / is the average 
current and Vcc is the supply voltage. Since power is the rate at which energy is 
consumed, the energy consumed by a program is given hy: E = PxT where T 
is the execution time of the program. This in turn is given by: T = Nxx where 
N is the number of clock cycles taken by the program and x is the clock period. 

In common usage, the terms power consumption and energy consumption 
are often interchanged. However it is important to distinguish between the two 
when we talk of either of these in the context of programs running on mobile 
applications. Mobile systems run on the limited energy available in a battery. 
Therefore the energy consumed by the system or by the software running on it 
determines the length of the battery life. Energy consumption is thus the focus 
of attention. We will attempt to maintain a distinction between the two in the 
rest of the paper. However, in certain cases the term power may be used to refer 
to energy, in adherence to common usage. 

2.2 Current Measurement 

For this study, the processor used was a 40MHz Intel 486DX2-S Series CPU. 
The CPU was part of a mobile personal computer evaluation board with 4MB of 
DRAM memory. The reason for the choice of this processor was that its board 
setup allowed the measurement of the CPU and DRAM subsystem current in 
isolation from the rest of the system. We would like to emphasize that while the 
numbers we report here are specific to this processor and board, the methodol- 
ogy used by us in developing the model is widely applicable. The current was 
measured through a standard off the shelf, dual-slope integrating digital amme- 
ter. 

If a program completes execution in a short time, a current reading cannot be 
obtained visually. To overcome this, the programs being considered were put in 
infinite loops and current readings were taken. The current consumption in the 
CPU will vary in time depending on what instructions are being executed. But 
since the chosen ammeter averages current over a window of time (100ms), if 
the execution time of the program is much less than the width of this window, a 
stable reading will be obtained. 

The main limitation of this approach is that it will not work for programs with 
larger execution times since the ammeter may not show a stable reading. How- 
ever, in this study, the main use of this approach was in determining the current 
drawn while a particular instruction (instruction sequence) was being executed. 
A program written with several instances of the targeted instruction (instruction 
sequence) executing in a loop, has a periodic current waveform which yields a 
steady reading on the ammeter. This inexpensive approach works very well for 
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this. However the main concepts described in this paper are independent of the 
actual method used to measure average current. If sophisticated data acquisition 
based measurement instruments are available, the measurement method can be 
based on them. 

For our setup, Vcc was 3.3^ and t was 25ns, corresponding to the 40MHz 
internal frequency of the CPU. Given these constants, energy is proportional to 
the average current and number of cycles. Unless otherwise stated, the numbers 
reported in this paper correspond to average current in mA. 

3. Instruction Level Modeling 

A modem microprocessor like the 486DX2 is an extremely complex system 
consisting of several interacting functional blocks. However, this internal com- 
plexity is hidden behind a simple interface - its instraction set. Thus to model 
the energy consumption of this complex system, it seemed intuitive to consider 
individual instmctions. Each instmction involves specific processing across var- 
ious units of the CPU. This can result in circuit activity that is characteristic of 
each instmction and can vary with instmctions. 

This intuition was the starting point for the empirical study that led to the 
development of the final instmction-level energy model. Under this model each 
instmction in the instmction set is assigned a fixed energy cost called the base 
energy cost. The variation in base costs of a given instmction due to different 
operand and address values is then quantified. The base energy cost of a pro- 
gram is based on the sum of the base energy costs of each executed instmction. 
However, during the execution of a program, certain inter-instmction effects 
occur whose energy contribution is not accounted for if only base costs are con- 
sidered. The first type of inter-instmction effect is the effect of circuit state. The 
second type is related to resource constraints that can lead to stalls and cache 
misses. The energy cost of these effects is also modeled and used to obtain the 
total energy cost of a program. 

The instmction-level energy model described here is based on actual mea- 
surements and evolved as a result of extensive experimentation. It is compre- 
hensive and provides all the information needed to evaluate programs in terms 
of their energy costs. The various components of this model are described in the 
subsections below. 

3.1 Base Energy Cost 

The base cost for an instmction is determined by constmcting a loop with 
several instances of the same instmction. The average current being drawn is 
then measured. This current multiplied by the number of cycles taken by each 
instance of the instmction is proportional to the total energy as described in 
Section 2. 
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STAGE # 1 2 3 4 5 



INSTRUCTION 


DECODE 1 


DECODE 2 


EXECUTION 


REGISTER 


FETCH 








WRITE BACK 



Figure 1. Internal Pipelining in the 486DX2 



While this method seems intuitive if the CPU is executing only one instruc- 
tion at a given time, most modem CPUs, including the 486DX2 are processing 
more than one instmction at a given time due to pipelining. However, the fol- 
lowing discussion shows that the concept of a base energy cost per instruction 
and its derivation remains unchanged. 



Number 


Instruction 


Base Cost 

(mA) 


Cycles 


1 


NOP 


275.7 


1 


2 


MOV DX.BX 


302.4 


1 


3 


MOV DX,[BX] 


428.3 


1 


4 


MOV DX, [BX] [DI] 


409.0 


2 


5 


MOV [BX],DX 


521.7 


1 


6 


MOV [BX][DI],DX 


451.7 


2 


7 


ADD DX,BX 


313.6 


1 


8 


ADD DX,[BX] 


400.1 


2 


9 


ADD [BX],DX 


415.7 


3 


10 


SAL BX,1 


300.8 


3 


11 


SAL BX.CL 


306.5 


3 


12 


LEA DX,[BX] 


364.4 


1 


13 


LEA DX, [BX] [DI] 


345.2 


2 


14 


JMP label 


373.0 


3 


15 


JZ label 


375.7 


3 


16 


JZ label 


355.9 


1 


17 


CMP BX.DX 


298.2 


1 


18 


CMP [BX],DX 


388.0 


2 



Table 1. Subset of the Base Cost Table for the 486DX2 



The 486DX2 CPU has a five-stage pipeline [1] as shown in Figure 1. Let 
E ji^ be the average energy consumed by pipeline stage j, when instmction 4 
executes in that stage. Pipeline stages are separated from each other by latches. 
Thus, if we ignore the effect of circuit state and resource constraints for now, 
the energy consumption of different stages is independent of each other. Let 
us assume that in a given cycle, instmction I\ is being processed by stage 1, 
h by stage 2, and so on. The total energy consumed by the CPU in that cycle 



134 



THE BEST OF ICCAD 



data 


0 


OF 


OFF 


OFFF 


OFFFF 


No. ofrs 


0 


4 


8 


12 


16 


Base Cost 


309.5 


305.2 


300.1 


294.2 


288.5 



Table 2. Base Costs of MOV BX, data 



would be: Ecycu = El[^+E2i2+E3i^+E4i^+E5iy On the other hand, the 
total energy consumed by a given instruction I\, as it moves through the various 
stages is: Ei,^ = '^jE jiy. This quantity actually refers to the base cost in the 
sense described above. Our method of forming a loop of instances of instruction 
l\, results in Ecycu = Ei^, since in that case, I\ = I2 = I-^ = I4 = 

The average current in this case is "ZjE jiJ{Vcc x x), which is the same as the 
ammeter reading obtained. 

Some instructions take multiple cycles in a given pipeline stage. All stages 
are then stalled. The reasoning applied above, however remains unchanged. The 
base energy cost of the instruction is just the observed average current value 
multiplied by the number of cycles taken by the instruction in that stage. For 
instance, consider a loop of instruction I\, where I\ takes m cycles in the 4th 
stage. Therefore, £4/, is spread over m cycles. For sake of brevity assume 
that each stalled stage consumes zero energy. Then the current value observed 
on the anuneter will be 'ZjE jiJ{Vcc x t x m). This quantity multiplied by m 
yields 'ZjEjiJ{Vcc x x), the base energy cost of the instruction, m represents 
the “number of cycles” parameter specified in instruction timing tables in mi- 
croprocessor manuals. 

Table 1 is a sample table of CPU base costs for some instructions for the 
486DX2.* The numbers in Column 3 are the base cost in mA per clock cycle. 
The overall base energy cost of an instruction is the product of the numbers in 
Columns 3 and 4 and the constants Vcc and x. 

Care should be taken in designing the experiments used to determine the base 
costs. The size of the loop has to be large enough to minimize the effects of the 
branch statement at the bottom of the loop and small enough not to cause any 
cache misses. Only the target instructions should execute on the CPU during the 
experiment and thus system effects like multiple time-sharing applications and 
frequent interrupts cannot be allowed. 

3.1.1 Variations in Base Cost. As Table 1 shows, instructions 
with differing functionality and different addressing modes can have very dif- 
ferent energy costs. This is to be expected since different functional blocks are 
being affected in different ways by these instructions. Within the same fam- 
ily of instructions, there is variability in base costs depending on the value of 
operands used. For example, consider the MOV register, immediate family. Use of 
different registers results in insignificant variation since the register file is prob- 
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ably a symmetric structure. Variation in the immediate value, however, leads 
to measurable variation. As an example, Table 2 shows the variation for MOV 
BX , immediate. The costs seem to be almost a linear function of the number of 
I’s in the binary representation of the immediate data - the more the I’s, the 
lesser the cost. Similarly, for the ADD instruction, the base costs are a function 
of the two numbers being added. The range of variation in all cases, however, 
is small. It is observed to be about 14, which corresponds to less than a 5% 
variation. 

For instructions involving memory operands, there is a variation in the base 
cost depending upon the address of the operand. The variation is of two kinds. 
The first is due to operands that are mis-aligned [1]. Mis-aligned accesses lead to 
cycle penalties and thus energy penalties that are added to the base cost. Within 
aligned accesses there is variation in the base cost depending upon the value of 
the address. For example, for MOV DX , [BX] , the base cost can be greater than 
the cost shown in Table 1 by about 3.5%. This variation is a function of the 
number of, and position of, I’s in the binary representation of the address. 

Given the operand value and address, exact base costs can be obtained through 
direct measurements. However, these exact values will be of little use since typ- 
ically a data or address value can be known only at run-time. Thus, from the 
point of view of program energy cost estimation, the only alternative is to use 
average base cost values. This is reasonable given that the variation in base costs 
is small and thus the discrepancy between the average and actual vlaues will be 
limited. 



3.2 Inter-instruction Effects 

When sequences of instructions are considered, certain inter-instruction ef- 
fects come into play, which are not reflected in the cost computed solely from 
base costs. These effects are discussed below. 

3.2.1 Effect of Circuit State. The switching activity in a circuit 
is a function of the present inputs and the previous state of the circuit. Thus, 
it can be expected that the actual energy cost of executing an instruction in a 
program may be different from the instruction’s base cost. This is because the 
previous instruction in the given program and in the program used for base cost 
determination may be different. For example, consider a loop of the following 
pair of instructions: 

XOR BX,1 
ADD AX.DX 

The base costs of the XOR and ADD instructions are 319.2 and 313.6. The 
expected base cost of the pair, using the individual base costs would be their 
average, i.e. 316.4, while the actual current is 323.2. It is greater by 6.8. The 
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reason is that the base costs are determined while executing the same instruction 
again and again. Thus each instruction executes in what we expect is a context 
of least change. At least, that is what the observations consistently seem to 
indicate. When a pair of two different instructions is considered, the context 
is one of greater change. The cost of a pair of instructions is always greater 
than the base cost of the pair and the difference is termed as the circuit state 
overhead. 

As another example, consider the following sequence of instructions. The 
base cost and the number of cycles of each instruction is listed alongside: 



Number 


Instruction 


Base Cost 


Cycles 


1 


MOV CX,1 


309.6 


1 


2 


ADD AX,BX 


313.6 


1 


3 


ADD DX,8[BX] 


400.2 


2 


4 


SAL AX,1 


308.3 


3 


5 


SAL BX,CL 


306.5 


3 



The measured cost was 332.8 (avg. current over 10 cycles). Using base costs 
we get 



(309.6 + 313.6 + 400.2 X 2 + 308.3 X 3 + 306.5 x3)/10 = 326.8 (1) 

The circuit state overhead is thus 6.0. 

It is possible to get a closer estimate if we consider the circuit state overhead 
between each pair of consecutive instructions. This is done as follows. Consider 
a loop of the targeted pair, e.g., instructions 2 and 3. The estimated cost for the 
pair is (2 x 400.2 + 313.6 x l)/3 = 371.3 , while the measured cost is 374.8. 
Thus, the circuit state overhead is 3.5. Now the overhead occurs twice in every 
3 cycles, once between instmctions 2&3, and once between 3&2. Since these 
two different cases cannot be resolved, let us assume that they are the same. 
Thus, the overhead each time it occurs would be 3.5 x | = 5.25. Similarly, the 
overhead between the pairs 1&2, 3&4, 4&5 and 5&1 is found to be 17.9, 12.25, 
3.3 and 17.2 respectively. When these overheads are added to the numerator in 
Equation 1, we get an estimated cost of 332.38, which is within 0.12% of the 
measured value. 

This example illustrates that by determining costs of pairs of instructions, it is 
possible to improve upon the results of the estimation obtained with base costs 
alone. However, extensive experiments with pairs of instructions revealed that 
the circuit state overhead has a limited range - between 5.0 and 30.0 and most 
frequently occurred in the vicinity of 15.0. This motivates an efficient yet fairly 
accurate way to account for the circuit state overhead. Calculate the average 
current for the program using the base costs. Then, add 15.0 to it, to account for 
circuit state overhead. 
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A specific manifestation of the effect of circuit state is the effect of switching 
that occurs on address and data lines. Our experiments revealed that the overall 
impact of this effect was small. For data reads from the cache, greater switching 
of the address values led to at most a 2% increase in the the energy cost while 
for data writes (which go to the cache and the memory bus), the overhead due 
to greater switching was less than 4%. 

The limited variation in the circuit state overhead is contrary to popular be- 
lief. In fact, a recent work [3], talks about scheduling instructions to reduce this 
overhead. But as our experiments reveal, the methods described in this work 
will not have much impact for the 486DX2. The probable explanation for the 
limited variation in circuit state overhead is that a major part of the circuit ac- 
tivity in a complex processor like the 486DX2, is common to alt instructions, 
e.g., instruction pre-fetch, pipeline control, clocks etc. While the circuit state 
may cause significant variation within certain modules, its impact on the overall 
energy cost is swamped by the much greater common cost. However, we would 
not like to rule out the impact of circuit state overhead for all processors. It 
may well be that it is a significant part of the energy consumption in DSPs and 
processors with complex power management features. An investigation of this 
issue is the subject of our future study. 

3.2.2 Effect of Resource Constraints. Resource constraints in 
the CPU can lead to stalls e.g. pipeline stalls and write buffer stalls [1,2]. These 
can be considered as another kind of inter-instruction effect. They cause an in- 
crease in the number of cycles needed to execute a sequence of instructions. For 
example, a sequence of 120 MOV DX , [BX] instructions takes about 164 cycles 
to execute, instead of 120 due to pre-fetch buffer stalls. While determining the 
base cost of instructions, it is important to avoid stalls, since they represent a 
condition that ought not to be reflected in the base cost. Thus, for MOV DX , [BX] 
a sequence consisting of 3 MOV instructions followed by a NOP is used since 
there are no stalls during its execution [2]. Knowing the cost of the NOP and the 
measured value for the sequence, the base cost of the MOV is determined. 

The energy cost of each kind of stall is experimentally determined through 
experiments that isolate the particular kind of stall. For example, an average 
cost of 250 per stall cycle was determined for the prefetch buffer stall. 

To account for the energy cost of the above stalls during program cost esti- 
mation, the number of stall cycles has to be multiplied by the experimentally 
determined stall energy cost. This product is then added to the base cost of 
the program. The number of stall cycles is estimated through a traversal of the 
program code. 

3.2.3 Effect of Cache Misses. Another inter-instruction effect is 
the effect of cache misses. The instruction timings listed in manuals give the 
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cycle count assuming a cache hit. For a cache miss, a certain cycle penalty 
has to be added to the instruction execution time. Along the same lines, the 
base costs for instructions with memory operands are determined in the context 
of cache hits. A cache miss will lead to extra cycles being consumed, which 
leads to an energy penalty. For experimentation purposes, a cache miss scenario 
is created by accessing memory addresses in an appropriate order. An average 
energy penalty of 216 per miss cycle has been experimentally obtained. This has 
to be multiplied by the average number of miss penalty cycles to get the average 
energy penalty for one miss. The average penalty multiplied by the cache miss 
rate is added to the base cost estimate to account for the cache misses during 
execution of a program. 

4. Estimation Framework 



Program 


Base Cost(mA) 


Cycles 


; Block B1 






main: 






mov bp,sp 


285.0 


1 


sub sp,4 


309.0 


1 


mov dx,0 


309.8 


1 


mov word ptr -4 [bp] , 0 


404.8 


2 


; Block B2 






L2: 






mov si, word ptr -4 [bp] 


433.4 


1 


add si, si 


309.0 


1 


add si, si 


309.0 


1 


mov bx,dx 


285.0 


1 


mov cx,word ptr >a[si] 


433.4 


1 


add bx,cx 


309.0 


1 


mov si, word ptr _b[si] 


433.4 


1 


add bx,si 


309.0 


1 


mov dx,bx 


285.0 


1 


mov di,word ptr -4 [bp] 


433.4 


1 


inc di , 1 


297.0 


1 


mov word ptr -4 [bp] , di 


560.1 


1 


cmp di,4 


313.1 


1 


jl L2 


405.7(356.9) 


3(1) 


; Block B3 






LI: 






mov word ptr _sum,dx 


521.7 


1 


mov sp,bp 


285.0 


1 


jmp main 


403.8 


3 



Table 8. Illustration of the Estimation Process 
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In this section we describe a framework for energy estimation of programs 
using the instruction level power model outlined in the previous section. We 
start by illustrating this estimation process for the program shown in Table 3. 
The program has three basic blocks as shown in the table.^ The average current 
per cycle and the number of cycles for each instruction are given along with 
each instruction. With these numbers the base cost of the basic block B1 is 
1713.4, B2 is 4709.8, and B3 is 2017.9. B1 is executed once, B2 4 times and 
B3 once. The jmp main statement has been inserted to put the program in an 
infinite loop. Cost of the j 1 L2 statement is not included in the cost of B2 since 
its cost is different depending on whether the jump is taken or not. It is taken 3 
times and not taken once. Multiplying each basic block by the number of times 
it is executed and adding the cost of the unconditional jump jl L2, we get a 
number proportional to the total energy cost of the program. Dividing it by the 
estimated number of cycles (72) gives us the cost per cycle (average current) 
of 369.1. Adding the circuit state overhead offset value of 15.0 we get 384.0. 
The actual measured average current is 385.0. This program does not have any 
stalls and thus no further additions to the estimated cost are required. If in the 
real execution of this program, some cold-start misses are expected, their energy 
overhead will have to be added. 

To validate the estimation model described in the previous section, experi- 
ments were conducted with several programs. A close correspondence between 
the estimated and measured cost was obtained. The estimated cost was typically 
within 3% of the measured cost. 

4.1 Overall Flow 

The overall flow of the estimation procedure is shown in Figure 2. Given an 
assembly or machine level program, it is first split up into basic blocks. The base 
cost of each instance of the basic block is determined by adding up the base costs 
of the instructions in the block. These costs are provided in a base cost table. 
The energy overhead due to pipeline, write buffer and other stalls is estimated 
for each basic block and added to the basic block cost. Next, the number of times 
each basic block is executed has to be determined. This depends on the path that 
the program follows and is dynamic, run-time information that is obtained from 
a program profiler. Given this information, each basic block is multiplied by the 
number of times it will be executed. The circuit-state overhead is added to the 
overall sum at this stage, or alternatively, it could have been determined for each 
basic block using a table of energy costs for pairs of instructions. An estimated 
cache penalty is added to get the final estimate. The cache penalty overhead 
computation needs an estimate of the miss ratio, which is obtained through a 
cache simulator. 
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Figure 2. Software Energy Consumption Estimation Methodology 



5. Memory System Modeling 

The energy consumption in the memory system is also a function of the soft- 
ware being executed. The salient observations regarding the DRAM system cur- 
rent on our experimental setup are briefly described here. Details are provided 
in a separate reference [6]. 

The DRAM system draws constant current when no memory access is taking 
place. This current value was determined to be 11.0mA or 5.3mA, depending on 
whether page mode was active or not. Greater current is drawn during a memory 
access. The exact value of this current depends on the address of the present and 
previous memory access. For example, for writes, the cost of a page hit is 122.8 
(for 3 cycles) and that of a miss is 247.8 (for 6 cycles). For page hits, a smaller 
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variation was observed depending on the number of bits that change from the 
previous address to the present. 

Let X be the sum of the energy costs of each individual memory access. Let 
n and m be the number of memory idle cycles during which the page mode is 
active and inactive, respectively. The total memory system energy cost is given 
by X + 77.0 X n + 5.3 X m. As discussed above, the quantity X depends on the 
location and sequence of memory accesses made by the program. Along with 
n and m, this is dynamic, run-time information, which can only be loosely esti- 
mated by static analysis. Thus, modeling of memory system energy consump- 
tion is difficult if only static analysis is used. However, as the above discussion 
shows, analysis of this consumption is feasible. This is significant, given that 
for systems with tight energy budgets, it is important to understand all sources 
of energy consumption. 

6. Software Power Optimization 

In recent years, there has been a spurt of research activity targeted at reducing 
the energy consumption in systems. This research, however, has by and large 
not recognized the potential energy savings achievable through optimization of 
software. This was mainly due to the lack of practical techniques for analyzing 
the energy consumption of programs. This deficiency has been alleviated by the 
measurement and estimation methodology described in the previous sections. 
This methodology makes it possible to compare and evaluate programs in terms 
of their energy consumption and also to study the effect of compilation on the 
energy consumption of programs. 

Using the results of this work, several possible avenues for energy reduction 
through code restructing and compilation have been studied [5]. Examples with 
energy reduction of up to 40% on the 486DX2 based system, obtained by rewrit- 
ing code, demonstrate the potential of these ideas. These ideas will be pursued 
further as part of the research in the area of software power optimization. 

7. Analysis of SPARClite 934 

The previous sections describe the application of the power analysis method- 
ology for the 486DX2, a CISC processor. To verify the general applicability of 
this methodology, it was decided to apply the methodology to a processor with a 
different architectural style. The Fujitsu SPARClite 934, a RISC processor tar- 
geted for embedded applications was chosen for this purpose. A power analysis 
of this processor has been performed using the measurement and experimenta- 
tion techniques described in the previous sections. The basic model of a base 
energy cost per instruction, enhanced by the inter-instruction effects remains 
valid for this processor, though the actual costs differ in value. The details of 
this analysis are described in a separate reference [4]. 
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8. Summary and Future Work 

This paper presents a methodology for analyzing the energy consumption of 
embedded software. It is based on an instruction level model that quantifies the 
energy cost of individual instructions and of the various inter-instruction effects. 
The motivation for the analysis methodology is three-fold. It provides insights 
into the energy consumption in processors. It can be used to help verify if an 
embedded design meets its energy constraints and it can also be used to guide 
the design of embedded software such that it meets these constraints. 

The methodology has so far been applied to two commercial processors, a 
CISC and a RISC. Future work will extend this to other architecture styles, to 
characterize and contrast their energy consumption models. DSPs, superscalar 
processors and processors with internal power management will be considered. 
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Notes 

1. All instructions are executed in “Real Mode”. All registers contain 0, except in entry 11, where CL 
contains 1. Entry 15 is a “taken” jump while entry 16 is “fall through”. Entries 5, 6 and 9 show normalized 
costs [6]. 

2. A basic block is defined as a contiguous section of code with exactly one entry and exit point. 
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Abstract 

In Deep Sub-Micron (DSM) designs, performance will depend critically on the latency of long 
wires. We propose a new synthesis methodology for synchronous systems that makes the de- 
sign functionally insensitive to the latency of long wires. Given a synchronous specification of 
a design, we generate a functionally equivalent synchronous implementation that can tolerate ar- 
bitrary communication latency between latches. By using latches we can break a long wire in 
short segments which can be traversed while meeting a single clock cycle constraint. The overall 
goal is to obtain a design that is robust with respect to delays of long wires, in a shorter time 
by reducing the multiple iterations between logical and physical design, and with performance 
that is optimized with respect to the speed of the single components of the design. In this paper 
we describe the details of the proposed methodology as well as report on the latency insensitive 
design of PDLX, an out-of-order microprocessor with speculative-execution. 



1. Introduction 

The advent of deep sub-micron (DSM) process technologies, OASfi and be- 
low, has generated a flurry of predictions on the effects of the inevitable dom- 
inance of wire delays on chip design. Although there is a certain amount of 
disagreement between the various studies on interconnect latencies in future 
design generations [9, 10], there is unanimity that the delay of a “long” wire 
will play a dominant role in logic synthesis and optimization. Recent ad- 
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vances on interconnect optimization techniques (such as interconnect topology 
optimization, optimal buffer insertion and sizing, optimal wire-sizing) can help 
to reduce interconnect delays significantly [8], but they are not able to reverse 
the trend of growing gap between device and interconnect performance [7]. In 
the current standard-cell design methodology, logic synthesis is performed us- 
ing delay estimates for library modules that are parameterized to account for 
loading factors and transition (or slew) rates. As the delay of long wires be- 
come larger relative to gate delays, these estimates become increasingly sensi- 
tive to layout. Attempts have already been made to account for layout effects 
by performing floor-planning and wire-planning on register-transfer level (RTL) 
descriptions [25]. Such an approach requires extreme precaution in deriving 
constraints for synthesis tools, since any wire whose delay approaches a single 
clock may cause a failure to meet the timing constraints. 

In this paper, we propose an alternative synthesis methodology that pro- 
duces designs functionally insensitive to the latency of long wires. Given a 
synchronous design consisting of several communicating modules, automatic 
synthesis techniques are used to generate a functionally equivalent synchronous 
implementation that can tolerate arbitrary communication latency between mod- 
ules. The overall goal it to achieve a robust design implementation that has as 
high a throughput as possible. As a preliminary assumption, each module must 
satisfy the stallability property, meaning that it can be stalled for an arbitrary 
amount of clock cycles without losing its internal state. In our implementation, 
the modules of the design communicate over channels, using a standard protocol 
that is insensitive to latency. This protocol allows a channel to run a number of 
clock cycles ahead of or behind other channels. The resulting system is guaran- 
teed by construction to be functionally equivalent to it. The system maintains 
the appearance of a fully synchronous system despite the non uniform latencies 
along communication channels of the actual implementation. 

The methodology is presented in Section 2 and discussed with respect to pre- 
vious work in Section 3. In Section 4, we summarize the theory of latency 
insensitive protocols. In Section 5, we address some issues related to latency 
insensitive protocol implementation. In Section 6 we report on performance 
evaluation of the latency-insensitive design methodology for a fairly complex 
prototype system. 

2. The Methodology 

The proposed methodology is based on the automatic synthesis of a communi- 
cation architecture implementing a latency insensitive communication protocol. 
It consists in a succession of five basic steps: 

1 The designer starts with a completely synchronous specification of the 
system and with a collection of modules, which can be either acquired as 
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intellectual property (IP) cores from a (internal or external) third-party 
or can be specified as “synthesizable” code using a hardware description 
language such as Verilog or VHDL. 

2 Communicating modules are connected by means of channels as illus- 
trated in Figure 1. Each channel operates using a latency-insensitive 
communication protocol and is made up of wires and logic blocks called 
relay stations. The wires of a channel are laid out together and share 
physical characteristics. The relay stations consist of latches together 
with logic gates implementing the functionality related to the latency- 
insensitive communication protocol. 

3 Each module is encapsulated within a logic block called shell, playing the 
role of interface towards the communication architecture. 

4 The layout is obtained using standard place & route tools. 

5 A post-layout optimization step is performed to insert the necessary num- 
ber of relay stations into each “critical channel” to ensure that the cycle 
time is met (channel segmentation). Some iterations may be required, but 
they are limited to each channel separately, while logic and layout of all 
modules remain untouched. 

The essential point in this methodology is the orthogonalization of concerns 
between behavior and communication. Since the communication mechanism is 
automatically synthesized (as described later in this paper both relay stations 
and shells can be built with no intervention of the designer based only on the 
theory of latency insensitive protocols), the designer can focus on the choice 
of the modules that make up the functionality of the implementation without 
worrying about synchronization and latency of the overall design. 




Figure 1. Shell Encapsulation and Communication Channels. 
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Communication design does not have any impact on the design and imple- 
mentation of the modules provided that the modules and the relay stations share 
a fundamental property, patience (see Section 4). Requiring that an arbitrary 
module is patient at the onset is quite strong. This is the reason why we en- 
capsulate the modules with an appropriate shell that has the task of making the 
module look patient. Such shells can be automatically generated for all modules 
if the output of the module is latched and each module is stallable [4]. "Stal- 
lability” means that a module can stall for an arbitrary amount of clock cycles 
without losing its internal state (and the overall state of the system) and is much 
weaker than patience ' . 

3. Related Work 

The adoption of DSM process technologies and the increasing impact of in- 
terconnect delay are destined to exacerbate the timing-closure problem', the de- 
signer is forced to iterate many times between synthesis and layout, because 
the two steps are performed independently and synthesis uses statistical delay 
models which badly estimate the post-layout wire load capacitance [7, 19]. 

In [10] Sylvester and Keutzer discuss the impact of DSM geometries on the 
future of design automation methodologies and envision that future integrated 
circuits will be implemented hierarchically with large macro-blocks of approx- 
imately 50^ to 100^ gates. They conclude that traditional standard-cell design 
flow will be still used for the design of such macro-blocks, because "intercon- 
nect delay will be small (< 25%) in block of 50K gates". These results are 
obtained from the analysis of detailed ASIC design data, such as average wire- 
lengths and average net fan-out. However, one must observe that the timing 
closure problem arises when the delay of the critical path in the design is exces- 
sive, and, therefore, it is by nature a worst-case problem and not an average-case 
problem. Most of the solutions proposed in literature so far call for tighter inter- 
action between synthesis and physical design. A synthesis-driven methodology 
that optimizes for interconnect delay rather than gate delay during logic synthe- 
sis is presented in [14]. Unfortunately, the approach produces a large amount 
of logic duplication, which may lead to expensive area overheads. Floorplan- 
ning, technology mapping and gate placement are combined in [27], where, 
after placement has been completed, the critical paths are reduced one at a time 
to meet the timing requirements. Since to fix one critical path may generate new 
ones, this approach is unable to solve by construction the convergence problem. 
A series of layout-driven approaches suggest to fix the layout by extracting ac- 
curate physical informations which are used to guide different types of logic 
optimization, such as gate-resizing [16], fanout optimization [18], buffer inser- 
tion [28], and logic resynthesis [22]. 
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All these approaches represent remedies to the effects of bad estimations 
made during logic synthesis and do not seem able to scale well with the shrink- 
ing of process geometries. Following the old adage that an ounce of prevention 
is worth a pound of cure, we believe that the time for a radical paradigm shift is 
approaching. 

3.1 Latency Insensitive vs. Asynchronous Design 

The latency insensitive design methodology is clearly reminiscent of many 
ideas which have been proposed in the asynchronous design community dur- 
ing the past three decades [11]. In particular, the idea of a design method- 
ology which is inherently modular is already present in the work on Macro- 
modular Computer Systems by Clark and Molnar [5, 6]. To separate the de- 
sign of these modules by the design of the system and make the entire process 
amenable to automation, the modules must be implemented as delay-insensitive 
circuits [24, 26]. A delay-insensitive circuit is designed to operate correctly 
regardless of the delays on its gates and wires (unbounded delay model) [32]. 
However, it has been proven that almost no useful delay-insensitive circuits can 
be built if one is restricted to a class of simple logic gates [2, 23]. To be able to 
build complex systems one must use more complex components, which are “ex- 
ternally” delay insensitive, while “internally” are designed by carefully verify- 
ing their timing and avoiding or tolerating metastability [13, 17, 26]. By slightly 
relaxing the unbounded delay model and allowing “isochronic forks” prac- 
tical quasi-delay-insensitive circuits can be built using simple logic gates [3]. 
A further relaxation leads to speed independent circuits, which operate cor- 
rectly regardless of gate delays, while wire delays are assumed to be negligi- 
ble [1, 12, 20]. Both quasi-delay-insensitive and speed-independent circuits as- 
sume that the designer is able to control wire delays, and, therefore, do not ap- 
pear as interesting alternatives when moving to DSM implementations. Instead, 
a methodology based on assembling complex modules which are “externally” 
delay-insensitive seems the right solution, on condition that the synthesis of such 
modules is not too cumbersome. However, it must be noted that asynchronous 
approaches do not address the fundamental problem of latency, because an asyn- 
chronous design simply slows down to accommodate the slowest component, 
e.g. the wires. 

While a delay insensitive system is based on the assumption that the delay 
between two subsequent events on a communication channel is completely arbi- 
trary, in the case of a latency insensitive system this arbitrary delay is a multiple 
of the clock period. The key point is that this kind of discretization allows us 
to leverage well-accepted design methodologies for the design and validation 
of synchronous circuits. In fact, the basic distinction between any of the pre- 
vious asynchronous design methodologies and the latency-insensitive one is es- 
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sentially that a latency insensitive system is specified as a synchronous system. 
Notice that we say “specified” because, from an implementation point of view 
a latency-insensitive communication protocol can also be realized using hand- 
shaking signaling techniques (such as request/acknowledge protocols), which 
are typically asynchronous However, from a specification point of view, 
each module (as well as the overall system) is viewed as a synchronous sys- 
tem. Now, to specify a complex system as a collection of modules whose state 
is updated collectively in one “zero-time” step is naturally simpler than spec- 
ifying the same system as the interaction of many components whose state is 
updated following an intricate set of interdependency relations. Furthermore, 
the synchronous specification allows us to slightly modify the traditional semi- 
custom design methodology, by simply inserting a step to encapsulate each syn- 
chronous module within a shell. Finally, the impact is very different also from a 
validation point of view because simulation is naturally a less complex task for 
a synchronous circuit than an equivalent asynchronous one. In conclusion, the 
proposed methodology can be implemented on top of the commonly-adopted 
standard-cell design flow, while all previous asynchronous approaches force the 
designer to use new tools and, more importantly, to think the digital system in a 
completely different way. 

4. Latency Insensitive Protocols 

The proposed design methodology is based on the theory of latency insensi- 
tive protocols, which has been recently presented in literature [4]. This theory 
can be summarized as follows. A latency insensitive protocol is a communica- 
tion protocol governing the exchange of information in a patient system. Ac- 
cording to the Tagged-Signal Model [21] a system is a composition of processes 
communicating by exchanging signals, i.e. sequences of events, on a set of 
channels. A behavior of a system is unambiguously described by the set of 
signals which are exchanged among its processes. A patient system is a syn- 
chronous system whose functionality only depends on the order of the events of 
each signals and not on their exact timing. More specifically, a patient system 
is a collection of patient processes communicating by means of “point-to-point” 
channels whose latency may be arbitrary. Normally, at every cycle tk, a generic 
patient process f) receives a new informative event on each of its input channels 
and it emits informative events, which are the result of its internal computation 
up to the previous cycle t^-i, on its output channels. However, due to channel 
arbitrary latencies, it may happen that at cycle tk a stalling event (denoting the 
absence of an informative event) arrives on one or more of its input channels. If 
this is the case, process F, (being patient) waits an arbitrary but finite amount of 
extra cycles until all informative events (which were expected at tk) have arrived 
on all input channels. During this wait, Pj emits stalling events. Any sequence 
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of stalling cycles does not affect the internal state of Pi (the process is patient) as 
well as the overall state of the system (the protocol guarantees that all processes 
awaiting data from P, receive instead a stalling event). 

If all channels in the system have unit latency then no stalling events are 
exchanged among its processes. Let 5^/ be a patient system with such a charac- 
teristic. Then, let Sstaii be another patient system which is composed by exactly 
the same processes as Sref, while having some channels with latency greater 
than one clock cycle. Now, assume to apply to the two systems the same exter- 
nal stimulus yielding two corresponding behaviors and %taii- If all stalling 
events are filtered away from the resulting behavior is exactly equal to 
^ref- The two behaviors are said latency equivalent. Further, if every behav- 
ior of Sref is latency equivalent to some behavior of Sstaii (and vice versa) then 
the two processes are said to be latency equivalent. It has been proven that, for 
patient processes, latency equivalence is compositional [4]. 

A relay station is a patient process communicating with two channels ci 
and Co such that if Si and s„ are the signals associated to the channels and 
I{l,k, Si),l < k denotes the sequence of informative events of s,- between the 
l-th clock cycle and the k-th one, then s, and Sg are latency equivalent and for all 



The following is an example of relay station behavior, where x denotes a stalling 
events and i/ a generic informative event: 



Notice, that no further specification has been given on the signals 5,- and Sg, (for 
instance saying that s,- is the input and Sg is the output). The definition of re- 
lay station simply involves a set of relations, i.e. a protocol, between s,- and Sg 
without any implementation detail. Still, it is clear that each informative event 
received on channel q is later emitted on Cg, while the presence of a stalling 
event on Cg may induce a stalling event on c; in a later cycle. In fact, an informa- 
tive event takes at least one clock cycle to pass through a relay station (minimum 
forward latency = 1), at most two informative events can arrive on Cj while no 
informative events are emitted on Cg (internal storage capacity = 2), and, finally, 
one extra stalling event on Cg will “move” into q in at least one cycle (min- 
imum backward latency = 1). The double storage capacity of a relay station 
permits, in the best case, to communicate with maximum throughput (equal to 
one): a practical confirmation of this fact is given in Section 5, where an RTL 
implementation of a relay station is discussed. 



k 



7(1, (^-1), Si) - 7(1, ^,s<,) > 0 
7(l,^,Si) - 7(l,(*-l),s„) < 2 



( 1 ) 

( 2 ) 



Si = li l 2 I3 T T I4 I5 l6 T X T I7 T Ig I9 lio • • • 

So = X li I2 I3 X X I4 X X X I5 l6 I7 X Ig I9 lio . . . 
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Since relay stations are patient processes, their insertion in a patient system 
guarantees that the system remains patient. Further, since they have minimum 
latencies equal to one, they can be repetitively inserted on a channel to increase 
its latency. Therefore, the methodology is patterned after the theory as follows: 
(1) we start giving an abstract specification of a digital system as collection 
of synchronous modules without making any assumption on the latency of the 
wires (which are grouped in channels), then (2) we automatically synthesize 
a corresponding layout, (3) we segment every wire whose latency is greater 
than the desired clock period by distributing on it the necessary amount of relay 
stations, and (4) we build the shell around the modules to obtain patient pro- 
cesses that interact with the appropriate relay stations. Obviously, the final result 
will be satisfactory only to the extent that a sufficient throughput can be main- 
tained in the presence of increased latency of wires. However, this is a general 
problem that will have to be faced in the design of large chips with DSM tech- 
nologies, and not specific to the latency insensitive methodology. On the other 
hand, the latency insensitive methodology allows an easy early exploration of 
latency/throughput tradeoffs as illustrated in Section 6. 

5. The Implementation of the Protocol 

In this Section, we present a latency insensitive communication architecture 
consisting of channels, relay stations, and shells built according to our method- 
ology. 



5.1 Channels 

Channels are point-to-point unidirectional links between a source module and 
a sink module. Data are transmitted on a channel by means of packets', a packet 
consists of a variable number of fields. Here, we consider only two basic fields: 
payload contains the transmitted data and void is a one bit flag which, if set 
to 1, denotes that no data are present in the packet (void packet). If a packet 
does contain “meaningful” payload data (i.e., void is set to 0) we will call it 
a true packet. A channel is made of wires and relay stations. The number of 
relay stations in a channel is finite and represents the buffering capability of the 
channel. 

At each clock cycle, the source module may either put a new true packet on 
the channel or, in case no output data are available to be sent, put a void packet 
on it; on the other side, at each clock cycle the sink module retrieves from the 
channel the incoming packet and, on the basis of the void field value, decides 
whether to discard it or to store it on its input channel queue for later use. As a 
source module might not be ready to send a true packet, so a sink module might 
not be ready to receive it, for instance because its input queue is full. However, 
the latency insensitive protocol demands a fully reliable communication among 
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the modules, where no lossy communication link is allowed and all packets are 
properly delivered. Consequently, the sink module must have a way to interact 
with the channel (and ultimately with the corresponding source module) to stop 
momentarily the communication flow and avoid the loss of any packet. There- 
fore, we slightly relax our definition of a channel as unidirectional, to allow a 
bit of information, called the channel stop flag, moving in the opposite direc- 
tion. By setting the stop flag equal to one during a certain clock cycle, the sink 
module informs the channel that the next packet can not be received and it must 
be held until the stop flag is reset. As the sink module also the channel has a 
limited amount of buffering resources: a channel dealing with a sink module 
that requires a long stall period may fill up all its relay stations and being forced 
to send a stop flag to the source module so that the latter will put its packet 
production on stall. 

5.2 Relay Stations 

Figure 2 illustrates a possible relay station implementation based on the fol- 
lowing specification, which refines the abstract notion given in Section 4: 

“At each clock cycle t it takes a packet packet and a stop signal stoplnf'^^ 

as inputs and it emits a packet packetOufl^^ and a stop signal stopOuf'^^ as 
outputs: stopOut'^^ is always equal to stoplrf , while, according to the value 
of the internal variable stalling = stopIn‘ A stopIn‘~^ the relay station decides 
whether to set packetOut''^^ equal to packetin! (case: stalling = 0) or to stall 
by keeping packet Out' equal to packet Out' and saving packet In' value into 
an auxiliary register (case: stalling = 1)”. 




Figure 2. Relay Station Implementation. 



Figure 3 illustrates two modules. Fetch Unit and Instruction Cache, commu- 
nicating using two channels Address Channel and Data Channel. Both channels 
have been partitioned in 4 segments by the insertion of 3 relay stations and, as 
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Data. nnel Cha 



Figure 3. Channels between Fetch Unit and Instruction Cache. 



a consequence, the lower bound on the latency of each channel has become 4 
clock cycles. Figure 4 reports a snapshot of the waveforms obtained by sim- 
ulating a Verilog RTL description of the Address Channel: here, the source 
module is the Fetch Unit producing a sequence of addresses for a Memory Block 
which represents the sink module. The addresses are reported as hexadecimal 
numbers. 

Beside the system clock having period Tclk equal to lOn^, one can see 8 
waveforms which, going from top to bottom, correspond respectively to the fol- 
lowing signals of Figure 3: Rl.packetOut, Rl.stopin, Rl.packetOut, Rl.stopin, 
RO.packetOut, RO.stopIn, FU .packetOut, FU.stopIn. 

At time t = 15ns the sink module sets Rl.stopin equal to one and keeps 
it equal to one for three clock cycles. As a consequence, R2 stalls two cy- 
cles as it maintains Rl.packetOut = h'44 for the next three cycles while stor- 
ing Rl.packetOut = h'45 on a auxiliary set of registers. In the meantime, the 
stop signal is propagated to Rl.stopin. When, after three clock cycles, at time 
t = 105h 5, the sink module can finally receive Rl.packetOut = h'44, it resets 
Rl.stopin such that at the following clock cycle R1 may set Rl.packetOut = 
h! 45. In the meantime, the three consecutive high values of the stop signal prop- 
agate back through the channel, provoking a stall of two cycles for each station 
while guaranteeing that no packets are lost. Notice that a characteristic of this 
implementation of the protocol is that when a stopin signal is kept high for only 
one cycle, the relay station does not really stall: in Figure 4 this can be observed 
for the sequence of clock cycles starting at t = I65ns. This fact is simply a pos- 
itive bi-product of the fact that the storing capacity of a relay station is double 

5.3 Shells 

As introduced in Section 2, given a particular module M, an instance of a shell 
can be automatically synthesized as a wrapper to encapsulate M and interface 
it with the channels so that M becomes a patient process. To do so the only 
necessary condition is that M be stallable. 

At each clock cycle the module internal computation must he fired only if all 
inputs have arrived. Guaranteeing this input synchronization is the first task of 
the shell of a module. The second task is called output propagation: at each 
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Figure 4- Waveforms on Address Channel. 



clock cycle, if module M has produced new output values and no output channel 
has previously raised a stop flag, then these output values can be transmitted 
generating new true packets; if any of these two conditions is not verified, then 
the packet transmitted in the previous cycle is re-transmitted as a void packet. 

In summary a shell for module M performs the following actions cyclically; 

1 it gets the incoming packets from the input channels, filters away the void 
packets and extracts the input values for M from the payload fields of the 
true packets; 

2 when all input values are available for the next computation, it passes 
them to M and fires the computation; 

3 it gets the results of the computation from M; 

4 if no output channel has previously raised a stop flag, it routes the result 
into the output channels. 
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6. Case Study: The PDLX Microprocessor 

To test our methodology, we performed a “latency insensitive design” of an 
out-of-order microprocessor (PDLX) with speculative execution. Its instruction 
set is the same of the DLX microprocessor, described in [15]. Its architecture is 
based on an extended version of the Tomasulo’s Algorithm [31], which combines 
traditional dynamic scheduling with hardware-based speculative execution. The 
data-path of PDLX is similar to the one of some of the most advanced micro- 
processor available on the market today. 

Figure 5 illustrates a simplified block diagram of the PDLX architecture: the 
PC Unit sends the current value of the Program Counter (PC) to the Instruction 
Cache and the Fetch Unit. After receiving the corresponding instruction, the 
Fetch Unit couples it with the PC value and sends it to the Decode Unit. Once 
instruction decoding is completed, the result arrives to the Execution Unit which 
performs the execution phase working with the Data Cache and the Register 
File. If the result of the execution is a "branch taken”, then the branch target 
address is sent to the PC Unit. 




Figure 5. PDLX Microprocessor Block Diagram : top level view. 



In our implementation, the 7 units correspond to 7 modules made patient by 
adding an appropriate shell. Obviously, this decomposition of the hardware im- 
plementing the PDLX, is not the only possible, let alone the best one. Still, 
while reasonably simple, it presents interesting challenges to the realization of 
the proposed latency insensitive communication architecture. In particular, the 
Fetch Unit shell merges two separate channels (likely they have different laten- 
cies), and each time a "branch taken” is executed a “feedback path” is activated 
between the Execution Unit and the PC Unit. 

We performed a high-level cycle-accurate design of PDLX by using BONeS 
DESIGNER [29]. We first designed the PDLX modules illustrated in Figure 5, 
keeping in mind only the following informal rule to make the process stallable: 
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At each clock cycle the execution process of a module can always be frozen 
without affecting its internal state. Then, we designed the latency insensitive 
protocol library, containing as building blocks relay stations and shells. Finally, 
we encapsulated each module in a shell and we obtained the final system. To 
test our design, we took some simple numerical C programs (permutations, bi- 
nary search,. . . ) and we generated the corresponding DLX assembler code by 
using DLXCC, a publicly available DLX compiler [30]. Then, we loaded the 
assembler into the PDLX Instruction Cache and we executed it, while logging 
every read/write access to the Data Cache. Finally, we compared the “log file” 
with the one obtained executing the same assembler code on the DLX simulator 
DLXSIM to verify that the functional behavior was indeed the same. 




Figure 6. Effective Throughput vs. PDLX implementations. 



For each program execution, we computed the total number of clock cycles T 
necessary to complete the execution of the assembler code: this number is equal 
to I +S+P, where I is the number of instruction which have been committed, S 
is the number of cycles lost due to a stall within the execution unit, and P is the 
number of cycles lost due to pipeline latency. Since the PDLX is a single-issue 
multiprocessor, the instruction throughput T = I /T is a quantity less than or 
equal to one. This quantity can be multiplied by the system clock frequency to 
obtain the effective instruction throughput ET = (I/T) * fciK^ which allows us 
to compare the execution of the same assembler code on different PDLX imple- 
mentations running at different speeds. Figure 6 illustrates the results obtained 
running three different assembler programs: the effective instruction throughput 
is reported on the y-axis, while each discrete point on the x-axis corresponds to 
a different PDLX implementation with a different fixed amount of latency on 
some channels. We focused on two specific channels on Figure 5: channel Ca 
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between the Execution Unit and the Data Cache and channel Cb between the 
Fetch Unit and the Instruction Cache. We assumed that the wires grouped in 
these two channels represent the critical path of the PDLX design and that, after 
segmenting them (by inserting relay stations), we could afford to raise the clock 
frequency appropriately. We varied the latency on the two channels as follows: 
going from left to right on the .x:-axis, the 18 data-points represent 18 implemen- 
tation cases which can be grouped in three subsets in correspondence to latency 
values La for Ca equal respectively to 0, 1,2 clock cycles. Each of these sub- 
sets contains 6 data-points corresponding to latency values Lb for Cb going from 
0 to 5 clock cycles. Finally, for each implementation case, we set the system 
clock frequency as fcLK = tti\xi{La,Lb] + 1. The plot illustrates how different 
PDLX implementations perform under the same data stimulus, showing that the 
throughput values are consistent across different benchmarks. All implemen- 
tations are functionally equivalent by construction, being obtained simply by 
changing the number of relay stations on the channels and with no need of re- 
designing any PDLX module. The insertion of relay stations can be made at late 
stages in the design process, after detailed information can be extracted from the 
physical layout, to “fix” those channels whose latency is longer than the desired 
clock cycle. 

7. Conclusions and Future Work 

We proposed a new “correct-by-construction ” synthesis methodology for de- 
signing very large digital systems by assembling IP functional modules. The 
modules communicate by exchanging data on communication channels accord- 
ing to an appropriate protocol, which guarantees a correct system behavior inde- 
pendently from channel latencies. As a consequence, a robust implementation 
is achieved in a shorter time by reducing the multiple iterations between log- 
ical and physical design. We developed a set of RTL libraries for a specific 
latency insensitive protocol and we used them to design a latency insensitive 
implementation of PDLX, an out-of-order microprocessor with speculative ex- 
ecution. There are several avenues for further investigations: (1) application to 
other designs, particularly in the multimedia domain, (2) study of the impact of 
our approach on other design metrics such as area and, especially, power, (3) 
extension of the theory to speculation insensitive protocols. 
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1 . Observe that most hardware systems can be easily made stallable: for instance, consider any sequential 
logic block together with a gated clock mechanism, or a Moore finite state machine with an extra input that 
can force it to stay in the current state while emitting a “flag signal”. 

2. A bounded skew is allowed between the different branches of a net. 

3. But the communication bandwidth would be limited by the inverse of the longest of the round trip 
times between pairs of communicating relay stations. 

4. Recall that the primary reason for this double capacity is the need of avoiding losing data while 
spending one cycle to propagate the stop signal. 
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Abstract 

VLIW ASIPs provide an attractive solution for increasingly pervasive real-time multimedia and 
signal processing embedded applications. In this paper we propose an algorithm to support trade- 
off exploration during the early phases of the design/specialization of VLIW ASIPs with clustered 
datapaths. For purposes of an early exploration step, we define a parameterized family of clustered 
datapaths D{m,n), where m and n denote interconnect capacity and cluster capacity constraints 
on the family. Given a kernel, the proposed algorithm explores the space of feasible clustered 
datapaths and returns: a datapath configuration; a binding and scheduling for the operations; and 
a corresponding estimate for the best achievable latency over the specified family. Moreover, we 
show how the parameters m and n, as well as a target latency optionally specified by the designer, 
can be used to effectively explore trade-offs among delay, power/energy, and latency. Exten- 
sive empirical evidence is provided showing that the proposed approach is strikingly effective at 
attacking this complex optimization problem. 



1. Introduction 

Real time multimedia and signal processing embedded applications often 
spend most of their cycles executing a few time critical code segments (kernels) 
with well defined characteristics, making them amenable to processor special- 
ization. Moreover, these computation intensive kernels often exhibit a high de- 
gree of inherent instruction level parallelism (ILP). Thus, Very Large Instruction 
Word (VLIW) Application Specific Instmction Set Processors (ASIPs) provide 
an attractive solution for such embedded applications. 

Traditionally, the datapaths of VLIW machines have been based on a single 
register file shared by all functional units (FUs). The central register file pro- 
vides internal storage as well as switching, i.e., interconnection among the FUs 
and to/from the memory system. Unfortunately, this simple organization does 
not scale well with the large number of functional units typically required to take 
advantage of the ILP present in the embedded applications of interest. Indeed, 
it has been shown in [14] that, for N FUs connected to a register file, the area 
of the register file grows as N^, the delay as and power dissipation as N^. 



^Author is currently with AnunoCore Technology, Santa Clara, CA. 
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In short, as the number of FUs increases, internal storage and communication 
between FUs quickly becomes the dominant, if not prohibitive factor, in terms 
of delay, power dissipation, and area. 

A key observation is that the delay, power dissipation and area associated with 
the storage organization can be dramatically reduced by restricting the connec- 
tivity between FUs and registers, so that each FU can only read and write from/to 
a limited subset of registers[14]. Thus a key dimension of VLIW ASIP special- 
ization is clustering, i.e., the development of datapaths comprised of clusters of 
FUs connected to local storage (the cluster’s register file). Although by moving 
from a centralized to a distributed register file organization one can reap sig- 
nificant delay, power and area savings, this type of specialization may come at 
a cost. One may have to transfer data among these register files (i.e., datapath 
clusters), possibly resulting in increased latency. 

More concretely, consider a family of clustered datapaths wherein each clus- 
ter has no more than a given number of FUs, irrespective of type. We shall refer 
to this constraint as a cluster capacity constraint. Intuitively, as the cluster ca- 
pacity decreases (and thus the number of ports and size of the associated register 
file decrease), one expects combinational delay as well as power dissipation to 
decrease, while the number of clock cycles (latency) required to execute a given 
kernel to increase. In the limit, when comparing a clustered machine to a hy- 
pothetical centralized machine with the same number of FUs, one expects to be 
able to sustain higher clock rates in the clustered machine, but at the cost of 
increased latency, due to the need to move data among register files. 

Moreover, as cluster capacity decreases, one also expects power dissipation 
to decrease with respect to the centralized machine. Indeed, clustered machines 
would have local register files that have fewer ports and are smaller than the 
single register file of the centralized machine, thus achieving a less costly (lo- 
cal) switching inside each cluster. Unfortunately, switching may also be needed 
among clusters, i.e., there may be a need to perform move (or copy) operations 
across register files of different clusters, with a corresponding undesirable effect 
in energy consumption. 

Note that while performance and power/energy are a major concern in em- 
bedded applications, silicon area (per se) is not necessarily a concern, since with 
today’s levels of integration one can cost-effectively place large numbers of tran- 
sistors on a single chip[l, 13]. Thus, in exploring the design space with respect 
to the impact on performance and power/energy of different cluster capacities, 
one can allow for an unbounded number of clusters - at least during the early 
phases of the exploration. In fact, as observed in [5], in signal processing ap- 
plications with high ILP, in order to achieve high throughput one should expect 
datapaths with a large number of functional resources and low resource sharing. 
Thus, placing an upper bound on the total number of functional resources was 
considered inadequate. One should however consider a constraint on the inter- 
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connect capacity, since congestion (during data transfers across clusters) may 
lead to major performance penalties, and the interconnect structure has a sig- 
nificant impact on the relevant figures of merit discussed above (i.e., delay and 
power/energy). 

In summary, when considering the specialization of a datapath to a given 
kernel, one should seek solutions with a (possibly large) number of clusters 
working (quasi-) independently. Note that such configurations are the “ideal” 
ones, in that they decrease power and delay, by taking advantage of locality in 
the computations, while incurring no (significant) latency/energy penalties due 
to switching across clusters. 

In this paper we propose an algorithm for estimating the minimum latency 
achievable by a family of clustered machines - the family is defined by the clus- 
ter and interconnect capacity constraints discussed above. In particular, given 
a kernel and capacity constraints, our algorithm explores the space of feasible 
clustered datapaths and returns: (1) an “optimal” datapath configuration; (2) a 
binding and scheduling for the operations; and (3) a corresponding estimate for 
the minimum achievable latency over the family of clustered datapaths. 

The algorithm proposed in this paper is a fundamental tool for the early ex- 
ploration required to design specialized clustered datapaths. To the best of our 
knowledge, this problem has not been addressed before, see §5. We formal- 
ize the problem under consideration in §2. Our novel approach is based on; 
(1) an effective decomposition of the problem into a sequence of simpler sub- 
problems; and (2) an aggressive heuristic pruning of the large design spaces 
defined by these sub-problems. This is discussed in §3. Extensive empirical 
evidence is provided in §4 showing that the approach is strikingly effective at 
attacking this exceedingly complex problem. Moreover, the discussions therein 
illustrate how the algorithm can be used within a general design space explo- 
ration framework. Conclusions are presented in §6. 

2. Problem Definition 

Our goal is to support early phases of the design of VLIW machines special- 
ized to execute time critical segments (kernels) of target embedded applications. 
The identification of these time critical segments, represented as basic blocks, 
superblocks, etc., is thus performed prior to this exploration step[10, 9]. These 
kernels are represented as dataflow graphs (DFGs), i.e., in terms of a DAG, 
G{V,E), where the nodes V represent operations to be carried out on datapath 
functional resources, e.g., adds, multiplies, etc., also called activities, and the 
edges E CV xV represent data objects that are “produced” and “consumed” 
by activities during the flow of execution, see e.g.. Figure 3. As discussed in 
the sequel, the DFG model of the application may be modified to include nodes 
corresponding to move/copy operations (i.e., data transfers across clusters) re- 
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quiring the interconnect resources. The location of such moves only becomes 
clear once a datapath and binding of functional operations to the datapath’s re- 
sources begin to be specified. 

The problem to be addressed is one of simultaneous allocation and binding, 
subject to coarse hierarchical “structural constraints.” We parameterize families 
of clustered datapaths D(m,n) as follows. Each datapath may contain several 
independent components, see e.g.. Figure 1. Each independent component, in 
turn, contains a collection of clustered FUs, i.e., ALUs and multipliers that share 
a common register file. The clusters within each component share a local inter- 
connect structure with capacity not exceeding m. Each cluster has no more than 
n FUs, but no limit is placed on the number of components and associated clus- 
ters that can be instantiated in the datapath. 




Figure 1. A component of a clustered VLIW datapath. 

A feasible binding of a DFG to a clustered datapath specifies on which clus- 
ters activities will (and can) execute. Given a binding of a dataflow to a datapath 
one can schedule activities so as to minimize execution latency. The problem to 
be addressed can be stated as follows: 

Problem 1 The problem P{m,n,DFG) is to find a datapath D* 6 D{m,n) and a 
binding and scheduling of the DFG to D* that results in a small, if not minimal, 
execution latency. We let T*{m,n,DFG) denote the minimal execution latency 
that can be achieved. 

Note that our coarse parameterization of datapaths is aimed at reducing the 
size/complexity of the design space for an initial exploration conducted at a 
high-level of abstraction. (This is not unlike approaches to similar high-complexity 
CAD / compilation problems, which typically resort to abstraction and problem 
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phasing, see e.g., [5, 10, 9].) Specifically, we do not model limitations on regis- 
ter file capacities for each cluster, and only consider data transfers of temporary 
values across clusters. Nevertheless, a solution to our abstract problem enables 
an effective exploration of a huge space of possible designs and thus datapath 
specialization along two critical dimensions; cluster and interconnect capacities. 
Promising datapath configurations are then considered in more detail, during 
subsequent phases of the VLIW ASIP design/specialization process[ll]. 

3. Algorithm to support VLIW datapath 
specialization 

The pseudo-code below exhibits the main high-level tasks of the proposed 
algorithm. The algorithm includes two main decompositions. The first, per- 
formed by the function Gen-IDFGs, corresponds to partitioning the DFG into a 
set of independent DFGs, or IDFGs, which can be addressed separately. Specif- 
ically, given a DFG we consider the associated undirected graph, and identify 
its connected components. Each connected component corresponds to an IDFG. 
Clearly, such IDFGs constitute ideal “chunks” of computation that can be per- 
formed on a single datapath component, requiring no local communication with 
other components. Thus for each IDFG the goal is to find an “optimal” clustered 
structure for the associated component. 



Algorithm (m,n,DFG,TL) { 

TL = max[ TL, ASAP(DFG)]; 

solution = (datapath, binding, schedule,latency) = (0,0,0,rL); 
UpdateSolution(solution) ; 

Set-IDFGs = GenlDFGs(DFG); 
for each IDFG 6 Set-IDFG { 
si = oneClusterSolution(m,n,IDFG); 
if ( iatency(sl) < TL){ UpdateSolution(sl); } 
else { s2 = muItClusterSoIution(m,n,IDFG); 
if (latency(s2) < latency(sl)) { 

UpdateSolution(s2); } 
else { UpdateSolution(sl);} 

} 

} return (solution); } 



//initialization 



// generate a set of IDFGs 
//decomposition 1 
//try 1 cluster solution 

//decomposition 2 
//choose min latency solution 



The second decomposition, which will be explained in more detail below, is 
used to synthesize multi-cluster datapath components suitable for each IDFG. 
The key idea is to decompose an IDFG into the operations which are the most 
difficult to handle. Each operation is given a difficulty ranking assessing the 
likelihood that latency penalty steps will be incurred due to limited cluster or 
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interconnect capacity, i.e., resulting from serializing operations within a cluster 
due to limited FU capacity, or from introducing data transfers across clusters. 
Given such a ranking, we extract a set of operations with highest rankings and 
consider the induced sub-IDFG. Our approach then determines datapath clus- 
ters, bindings, and a partial schedule which are suitable for the induced sub- 
IDFG in the sense of minimizing latency penalties. The next IDFG sub-problem 
is addressed in a similar fashion. 

In the sequel we focus on the key conceptual contributions of our approach, 
which are the decomposition of an IDFG into sub-problems, and a systematic 
method for synthesizing multi-cluster datapath solutions for such sub-problems. 



3.1 Difficulty ranking function and decomposition 

Our algorithm keeps track of, and updates, a global variable denoted target 
latency TL. The target latency is either specified by the designer or set to be the 
as-soon-as-possible (ASAP) latency bound for the DFG, denoted ASAP(DFG). 
Given TL, for each operation o in the DFG, we compute mobility(o) = ALAP(o) 
- ASAP(o), see e.g.. Figure 3.* The mobility corresponds to the operation’s dif- 
ficulty ranking. Clearly an operation with low mobility is likely to be difficult 
to handle as it has few temporal degrees of freedom to deal with possible serial- 
ization or data transfers among clusters. 

We will be progressively constructing a global solution, i.e., a datapath, a 
binding, a schedule, and a feasible latency for the complete problem, based on 
considering several sub-problems. Each time a sub-problem is solved, the func- 
tion UpdateSolutionO is invoked to update the current global solution. This in- 
volves several tasks. First, if the sub-problem solution exceeds the current target 
latency, the global solution is updated accordingly. Then, the sub-problem op- 
erations and data transfers are anchored on the corresponding scheduling steps. 
Finally, the mobility of operations not yet anchored is recomputed. 

The primary consideration driving the algorithm is to minimize execution la- 
tency. A secondary consideration is to minimize the number of data transfers 
among clusters. Thus for each IDFG (or IDFG sub-problem) the primary goal 
is to find a solution either within the current target latency, or resulting in a 
minimal increase in target latency. In each case, a single cluster solution, gen- 
erated by oneClusterSolutionO, and a multiple cluster solution, generated by 
muItClusterSolutionO, may be obtained. The function latencyO returns the 
execution latency of a solution to a sub-problem. To satisfy the secondary goal, 
preference is always given to single cluster solutions that can achieve the current 
target latency, or provide the same or better latency than multi-cluster solutions. 
This is done in an attempt to reduce the number of clusters and data transfers in 
the final solution - see comments in §4. 
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Figure 2. Simplified problem decomposition strategy. 



When a multi-cluster solution is sought for a given IDFG, our second de- 
composition takes place, see Figure 2. As mentioned previously, multi-cluster 
solutions are obtained by first tackling a sub-problem associated with the most 
difficult set of operations in an IDFG, denoted sub-problem 1. The function Ex- 
tractSubproblemlO, called within multClusterSolution(), extracts an induced 
sub-IDFG associated with the operations with minimum mobility, denoted MM, 
and those with mobility MM+ 1, if they have a direct producer or consumer 
with mobility MM. The rationale for including the second type of operations 
is that, if they were bound to a different cluster, they would incur additional 
data transfers reducing their mobility by at least one, and thus making them as 
difficult to handle as operations with mobility MM. For our benchmarks, this 
heuristic always selected more than 60 % of the IDFG’s operations. For the 
example in Figure 3 the induced sub-IDFG associated with extracting the first 
sub-problem is shown on the right. The induced sub-IDFG includes the subset 
of nodes satisfying the criterion, and edges among those nodes.^ In the sim- 
plest version of our algorithm, only one additional sub-problem is considered, 
namely ExtractSubproblem2(), associated with the operations not considered 
in the first sub-problem, see Figure 2. 

The key property underlying this sub-problem decomposition for IDFGs is 
as follows. The first problem is associated with the most difficult operations, 
but includes only a relaxed set of sequencing constraints. This makes it usually 
easier to solve, i.e., to find a suitable clustered datapath component, binding and 
schedule resulting in a reduced latency. If the first sub-problem cannot be solved 
within the current target latency, then invariably the original IDFG would not be 
feasible within that target latency. Formalizing and proving this fact is simple 
and gives the following: 

Fact 3.1 Suppose a dataflow graph subDFG induced by a subset of operations 
in a DFG. men T*{m,n, subDFG) < T*{m,n,DFG). 

Thus, if the first sub-problem incurs a latency penalty, i.e., forces an increase in 
the current target latency, then that penalty will persist and typically make the 
second sub-problem easier, if not trivial, to solve. 
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Figure 3. Ranking operations and extracting induced sub problems. 



Multi-cluster datapath components for an IDFG are synthesized by decom- 
posing the IDFG into several single cluster sub-problems, see Figure 2. These 
may correspond to the two IDFG sub-problems discussed above, or further de- 
compositions of these to smaller sub-problems, as discussed in the next section. 
In progressively synthesizing a multi-cluster solution, several single cluster so- 
lutions to parts of the IDFG are composed. In an attempt to reduce the number 
of clusters in the datapath components, prior to instantiating a new cluster and 
assigning it to an IDFG sub-problem, we first attempt to bind and schedule the 
associated operations on existing clusters. If the capacity available in such clus- 
ters is insufficient, i.e., if we fail to meet the current target latency, a new cluster 
is allocated, and its FU’s are selected to lead to the smallest latency penalty for 
the sub-problem under consideration. 

In this process the execution latency is evaluated by scheduling operations 
and moves using a simple modification of the list scheduling algorithm[3] which 
is described in the sequel. Moves, i.e., data transfers requiring the use of the 
shared interconnect resource, are inserted when operations sharing an edge are 
bound to different clusters. Thus, when a multi-cluster component is being 
synthesized, the accrued load on the shared interconnect resource impacts the 
scheduling of move operations associated with each of the IDFG’s sub-problems. 
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3.2 Finding multi-cluster solutions for IDFG 
sub-problems 

Let us first consider a simple example where each cluster can have at most 
one FU, i.e., n = 1. In this case, ideally, one identifies long independent strings 
of operations that require the same type of FU within the K)FG sub-problem, 
and binds such strings to independent clusters of the right type. By binding op- 
erations along such string to the same cluster, one avoids requiring data transfers 
for the edges (associated data objects) along the string. By ensuring the string 
includes operations of the same type, one avoids a mismatch between the load 
placed on the cluster and the capacity of the cluster. This suggests a general 
heuristic to determine datapath clusters and operation bindings for IDFG sub- 
problems that do not consist of independent strings and with nontrivial (n > 1) 
cluster capacities. The idea is to find sets of operations corresponding to sub- 
trees, rather than strings. We call these sets vertical aggregations and recog- 
nize that binding such aggregations to clusters with appropriate capacities might 
translate to reduced penalties due to data transfers. At the same time, it is of in- 
terest to identify sets of consecutive operations that have compatible resource 
requirements. We call these horizontal aggregations and recognize that bind- 
ing these to compatible clusters might also avoid excessive serialization within 
limited capacity clusters. 

Thus, our algorithm first determines vertical and horizontal aggregations of 
operations for IDFG sub-problems. Based on these, it creates possible partitions 
of its operations. By binding each element in the partition, i.e., set of operations, 
to a compatible cluster, we synthesize candidate clustered datapaths and bind- 
ings. The operations are then scheduled to evaluate the solution. We discuss 
these steps in more detail below. 

Vertical Aggregation. Given an extracted sub-IDFG, vertical aggrega- 
tion creates a collection of subsets of operations 'F' corresponding to sub-trees 
in the sub-IDFG, see e.g.. Figure 4. Since vertical aggregation is attempted for a 
sub-IDFG when a single cluster solution appears to be inadequate/inferior (see 
Figure 2), one can assume at least two clusters are required. Thus, in attempting 
to partition the sub-IDFG into vertical aggregates, we ensure that there always 
exist at least two ongoing subtrees, i.e., cluster parallelism to be exploited. As 
explained below, this requirement translates to avoiding yh/Z merging, i.e., avoid- 
ing aggregating all the operations within a single tree in any given “layer” of the 
sub-IDFG. 

Vertical aggregates are generated in both a top-down and bottom-up fashion, 
considering one layer at a time - a layer corresponds to the set of operations 
falling on a given step in the ASAP schedule for the sub-IDFG. For the top- 
down case, we begin by positing that each activity in the top layer (initialized 
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to the first step of its ASAP schedule) corresponds to an independent tree. At 
each step one considers growing and/or merging trees on the previous layer fol- 
lowing the edges between operations in the two layers, see e.g., Figure 4. Such 
growing/merging takes place only if (1) the resulting aggregates correspond to 
subtrees and (2) have not resulted in a single aggregate, merging of all 

the operations between the current layer and the top layer. When such grow- 
ing/merging of trees violates one of these conditions (see e.g. the transitions 
from Layer 4 to 5 in Figure 4) the current subsets, e.g., V 1 and V2, are added to 
the collection of vertical top-down aggregations 'I^, and the process restarts with 
the current layer as the new top layer. Thus, for our example. Layer 5 becomes 
the new top layer and the activities in the layer are assumed to correspond to in- 
dependent trees. In our example the transition from Layer 5 to 6 again leads to a 
full merging, and thus the sets V3 and V4 with single operations are added to Vt, 
and the process restarts on Layer 6, resulting in two additional vertical aggre- 
gation sets F5 and V6. In the sequel we will refer to aggregations that contain 
only one operation, as is the case for V3 and V4, as trivial and will attempt to 
merge these with larger vertical or horizontal aggregates. 



Suh-IDFG for suh-prohlem 1 Vertical aggregates 




Figure 4- Vertical aggregation. 

Bottom-Up vertical aggregations, denoted by are generated in the same 
fashion but starting from the bottom. Although for our example they result in 
the same aggregations, in general this is not the case. Since the sub-IDFG is 
a DAG, identification of the vertical aggregates is straightforward having only 
linear complexity in the number of nodes. Finally, we note that a more refined 
model for vertical aggregation could be obtained by hierarchically keeping track 
of all merge points seen in this process. Until now, we have found that in practice 
these fine grained partitions are not useful, i.e., even when data transfer delays 
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are small (1 cycle), scattering small aggregates across many clusters was never 
advantageous from a latency point of view, see §4. 

Horizontal Aggregation. The horizontal aggregation step creates a 
collection of aggregates corresponding to sub-DDFG operations on consec- 
utive layers with compatible loads. To do so, we determine the load profile 
for the operations on each layer, e.g., the multiplier load multload{i) on layer 
i is the sum of (mobility (o) -H 1)~* for all operations on layer i requiring a 
multiplier. We compute the ALU load, AW load {i), in a similar fashion.^ In- 
tuitively the resource load on two layers is compatible if there exists a clus- 
ter type that is a good match for both. Recall that a feasible cluster type is 
defined by the number of ALUs and multipliers it includes, so long as then- 
sum does not exceed our constraint n. We define a notion of load compati- 
bility for a constraint n as follows. For each layer i we determine the set of 
all feasible clusters types, CT{i) that would be able to support the layer’s re- 
source load in a minimum number of steps - non integral loads are rounded up. 
Two (or more) consecutive layers, say i and /-+- 1, are said to be compatible if 
CT{i)r\CT{i-\- 1) ^ 0, i.e., if there exists a cluster type that would be able to 
support their individual loads in a minimal number of steps. For illustration 
purposes, consider a hypothetical example including the following consecutive 
layers: Layer 1, with AWload{\) = 2 and multload{\) = 2; Layer 2, with 
AWload{2) = 2 and multload{2) = 0; and Layer 3, withAWload{3) = 3 and 
multload{3) = 0. Assuming a cluster capacity n = 3, the corresponding feasible 
cluster types would be CT(1) = {2A1M, 1A2M}, CT(2) = {2A1M,3A}, and 
CT’(3) = {3A}. For this example. Layers 1 and 2 would be compatible since 
Cr(l) nC7’(2) = {2A1M} ^ 0. Similarly, Layers 2 and 3 would be compatible 
since Cr(2) nC7(3) = {3A} 0. 

Figure 5 exhibits CT{i) when n = 2 for our example. Layers i = 5,6,7 
have CT{i) = {2A} (i.e., clusters with 2 ALUs) and are thus jointly compati- 
ble, so a single horizontal aggregation of operations, the set HI, is placed in 
the collection iK. In general includes the largest sets of compatible hor- 
izontal aggregations. Since our notion of load compatibility among layers is 
not transitive, in some cases such aggregates may overlap. For example, in 
the hypothetical case introduced above, the horizontal aggregation formed by 
Layers 1 and 2 is not compatible with that formed by Layers 2 and 3, since 
{2A1M} n {3A} = 0. Thus, two distinct (overlapping) horizontal aggregations 
would have been formed. 

Optimization Algorithm. The third step in determining a multi-cluster 
solution is to jointly search for a “good” overall partition of sub-IDFG oper- 
ations and corresponding binding of these partitions to suitable cluster types. 
Although this is a very complex task when a flat design space is considered. 
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Figure 5. Horizontal aggregation. 



in our approach the search space is dramatically reduced by the horizontal and 
vertical aggregates determined in the previous two steps. 

Specifically, our optimization heuristic proceeds as follows. 

Based on and 9{, we create coverings of the sub-DDFG’s operations.'^ 

In fact, we systematically generate several such coverings, as follows: (1) one 
based on 7^; (2) one based on and (3) several based on one or more sets 
in covering each of the uncovered horizontal slices entirely with elements 
of either *1^ or 'H. In most cases we have but a few horizontal aggregations, 
leading to a limited number of possible covers. In this process we ensure that 
no set in a cover is fully contained within another, but can not ensure that the 
obtained covers are in fact partitions. 

In the next phase of our algorithm we exhaustively derive partitions from 
each of the obtained covers, i.e., any operation that is included in two or more 
sets in a cover is removed and assigned to only one of them. Further bound- 
ary perturbations creating additional partitions can be useful, e.g., merging triv- 
ial vertical aggregation sets with neighboring aggregates or shifting operations 
across aggregates in layers where the interconnect capacity has been exhausted. 
Because these perturbations involve only operations on the boundaries between 
large aggregates, the demands in generating these are not excessive. This pro- 
cess eventually transforms each cover into one of many possible partitions, see 
e.g.. Figure 6. Although this process can grow exponentially in complexity, 
in practice the number of covers/partitions that were generated in all our case 
studies did not justify any further pruning. Specifically, given a partition, an 
exhaustive generation of boundary perturbations and scheduling of the corre- 
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spending alternatives took no more than a few seconds on an Sun UltraSparc I 
for the benchmarks shown in Table 1. 

Alternative multi-cluster datapaths are then generated based on these parti- 
tions. Each element in a partition corresponds to a single cluster problem, thus 
it makes sense to consider partitions with the fewest number of sets first. As dis- 
cussed in §3.1 and shown in Figure 2, single cluster solutions to each partition 
are composed to generate multi-cluster solutions to the (sub)IDFG, which are 
compared based on the execution latency they achieve. 

For our ongoing example, Figure 6 shows the best partition. The suitable 
cluster types when n = 2 were determined to be lAlM (i.e., 1 ALU and 1 mul- 
tiplier) and 2A, as shown in the figure. The minimum latency schedule (for 
interconnect capacity m = 2) was determined to be 10 steps, which is optimal 
for the given capacity constraints. At this point the solution for sub-problem 2 
would be generated. 



Possible partition Binding & required moves 







Wl 

s 



Figure 6. Generating covers, partitions, and deriving clusters types. 



Modified List Scheduling. Execution latencies are determined using 
the following modified list scheduling algorithms. Given an IDFG, its first sub- 
problem is scheduled (for the derived binding) using a standard list scheduling 
priority function (longest path to any sink operation), enhanced by a tie breaking 
policy. Specifically, in the case of a tie, operations that are ancestors of move 
operations are given higher priority. When an operation cannot be scheduled 
within its time frame^, TL is incremented, the time frames are updated, and the 
algorithm is repeated. 
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Recall that, when a solution for an IDFG’s first sub-problem is found, the 
operations of the sub-IDFG, including moves, are anchored to their associated 
scheduling steps, and the time frames and mobility of the remaining operations 
are recomputed. Accordingly, scheduling for an IDFG’s second sub-problem is 
performed as follows. A modified list scheduling algorithm traverses scheduling 
steps from 1 to TL and, at each step, schedules as many ready operations^ as 
available resources permit, using mobility as the priority function. As in the 
previous case, if a tie occurs, priority is given to nodes which are ancestors of 
unscheduled moves. When an operation cannot be scheduled within its time 
frame, the scheduling process stops, the anchoring of all operations is released, 
and an overall scheduling is performed for the same binding function and the 
same initial target latency, using the list scheduling algorithm described in the 
previous paragraph. 

4. Experimental Results 

Table 1 shows the results produced by our algorithm for a number of repre- 
sentative benchmark kernels. For simplicity, operations and data transfers were 
assumed to take 1 cycle in all test cases, but our approach is general. The three 
Discrete Cosine Transform (DCT) algorithms (Lee, DIT and DIF [8]) typify 
complex kernels with high potential for ILR The three filters (Elliptic, Au- 
toregression and Avenhaus [6]) typify complex kernels with less potential for 
ILR Information on the number of connected components (IDFGs), their crit- 
ical path (i.e., absolute minimum latency), and number of operations for each 
IDFG is provided in Column 1. For each benchmark, we considered datapaths 
with interconnect capacity m = 2 and cluster capacity constraints of n = 2,3,4. 
For each of the 18 problem instances considered, the derived clustered datapath 
and the associated achievable latency L are shown in the table. Also shown is 
the total number of data transfers, abbreviated DTs, with subtotals per datapath 
component. 

We start by noting that our algorithm consistently found minimum latency 
penalty solutions for the specified capacity constraints, strongly suggesting that 
our aggressive design space pruning is effective.^ Moreover, for all cases but 
one, the achievable latency was driven by sub-problem 1 of an EDFG, confirming 
the effectiveness of our decomposition heuristic based on difficulty rankings. 
The exception occurred for the DIT DCT benchmark, with constraints (m,n) = 
(2,4). In this case, contention for interconnect resources caused an additional 
latency penalty of 1 step for the second sub-problem associated with the IDFG. 
(Note that this benchmark has a single IDFG with 48 nodes and critical path of 7 
steps, thus contention on the local interconnect for the datapath component was 
likely to occur, if a high degree of ILP was to be achieved.) Note however that 
this did not occur for cluster capacities n = 2, 3. In these two cases an increased 
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Benchmarks 


(m,n) 


L 


Datapath 


#DTs 


DCT-Lee: 49 ops (29 add/subs, 
20 mults) 2 IDFGs, 

CP=9 

IDFGl: 28 oos, CP=9 
IDFG2: 21 ods, CP=7 


(2,4) 


10 


IDFGl: 2(2A2M)=2 
IDFG2: (2A2M)=1 


5 

(5+0) 


(2,3) 


12 


IDFGl: 

(2A1M)+(1A1M)=2 
IDFG2: (2A1M)=1 


5 

(5+0) 


(2,2) 


12 


IDFGl: 3(1A1M)=3 
IDFG2: 2(1A1M)=2 


11 

(7+4) 


DCT-DIF: 41 ons (29 add/subs. 
12 mults) 2 IDFGs, 

CP=7 

IDFGl: 24 ops, CP=7 
IDFG2: 17 ons, CP=5 


(2,4) 


9 


IDFGl: 2(2A2M)=2 
IDFG2: (2A1M)=1 


2 

(2+0) 


(2,3) 


10 


IDFGl: 2(2A1M)=2 
IDFG2: (2A1M)=1 


2 

(2+0) 


(2,2) 


13 


IDFGl: 2(1A1M)=2 
IDFG2: 1(1A1M)=1 


2 

(2+0) 


DCT-DIT: 48 ops (36 add/subs, 
12 mults) 1 IDFG, CP=7 
IDFGl: 48 ODS, CP=7 


(2,4) 


9 


IDFGl: (4A)+ 

(3A1M)+2(2A2M)=4 


9 


(2,3) 


10 


IDFGl: 2(3A)+ 

2(2A1M)+(1A1M)=5 


11 


(2,2) 


11 


IDFGl: 

3(2A)+4(1A1M)=7 


16 


5th order WDE Filter: 34 ops 
(26 add/subs, 8 mults) 1 IDFG, 
CP=14 

IDFGl: 34 ODS, CP=14 


(2,4) 


14 


IDFGl: 

(2A2M)+(2A1M)=2 


3 


(2,3) 


15 


IDFGl: 

(2A1M)+(1A1M)=2 


2 


(2,2) 


15 


IDFGl: 3(1A1M)=3 


7 


Auto Regression Filter: 28 ops 
(12 add/subs, 16 mults) 1 IDFG, 
CP=8 

IDFGl: 28 ops. CP=8 


(2,4) 


10 


IDFGl: 2(1A2M)=2 


4 


(2,3) 


10 


IDFGl: 2(1 A2M)=2 


4 




(2,2) 


11 


IDFGl: 2(1A1M)=2 


4 


Avenhaus Filter: 18 ops (8 

add/subs, 10 mults) 1 IDFG, 
CP=7 

IDFGl: 18 OPS, CP=7 


(2,4) 


8 


IDFGl: 

(1A2M)+(1A1M)=2 


3 


(2,3) 


8 


IDFGl: 

(1A2M)+(1A1M)=2 


3 


(2,2) 


9 


IDFGl: 2(1A1M)=2 


2 



Table 1. Experimental results. 



latency penalty resulted from sub-problem 1, which was sufficient to allow a 
solution for the rest of the E)FG without further latency increases/penalties. 

The results in Table 1 show that solutions derived for larger capacity cluster 
constraints may have the same latency as those for smaller capacity constraints, 
see e.g., DCT Lee and WDE filter for n = 2,3. Note however that the solu- 
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tions associated with the larger capacity clusters have fewer clusters (with more 
FUs), and typically have fewer data transfers. This clearly shows the bias of 
our algorithm towards serialization - solutions that add extra steps by serializ- 
ing operations inside a cluster are favored with respect to solutions scattering 
these operations through various (possibly smaller) clusters, and yet paying the 
same latency penalty due to data transfer delays. More concretely, the proposed 
algorithm is biased towards solutions that use fewer clusters of higher capacity, 
as opposed to using more clusters of capacity smaller than that specified in the 
constraint. The underlying rationale is that, when latency is identical, the first 
solutions will typically lead to fewer data transfers. Finally, note that for the 
Autoregression and Avenhous filters, the solutions with m = 3,4 are identical. 
This indicates that the extra cluster capacity does not help with these particular 
kernels. 

Other interesting observations could be drawn from our case studies. Con- 
sider for example the two alternative DCT algorithms (DIT and DIF) shown in 
the table. Although the DCT-DIT algorithm has roughly 20 % more operations 
than the DCT-DIF algorithm, it executes faster on a family with cluster capacity 
2, and has identical latency for larger capacities. Such non-trivial observations 
can be useful when performing algorithmic exploration in the context of a given 
embedded application. 

We conclude by briefly discussing how the proposed algorithm can be used 
to support trade-off exploration. Consider again the solution for the DIT DCT 
with a cluster capacity of 4. The “minimum” latency solution generated by the 
algorithm has 4 clusters and 9 steps, i.e., 2 steps in excess of the critical path. 
If the designer finds this number of clusters to be excessive, s/he can increase 
the target latency and execute the algorithm once more for the same capacity 
constraints. As the designer increases the initial target latency, the number of 
clusters in the solution will eventually reduce. Thus, the designer can explore 
trade-offs between latency and cluster area.* Delay/power vs. latency trade-offs 
can be explored by considering different capacity constraints. Indeed, as the 
constraint on cluster capacity increases, fewer steps are typically required to 
execute the same kernel, yet the register file local to each cluster will have more 
ports and be larger, and thus delay and power consumption will increase^ [14]. 
Our proposed algorithm can thus play a key role in a design space exploration 
environment/framework. 

5. Related Work 

A significant body of work is available in the area of datapath synthesis for 
digital signal processing applications, see e.g., [3, 5]. Our focus is on approaches 
geared towards high throughput applications, e.g., [5, 2, 4, 15]. The use of 
hierarchy in the DFG and in the datapath is an important common characteristic 
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of such approaches. Below we briefly contrast our work with the Cathedral 
compilers developed at IMEC[5]- a representative example. 

Cathedral uses an Application Specific Unit (ASU) based architectural style 
[5]. ASUs are datapaths whose composition in terms of functional building 
blocks (i.e., FUs) and interconnection structure is customized to parts of the ap- 
plication flow graph, i.e., to judiciously selected clusters of operations. Below 
we argue that the design space defined by the ASU-based architectural style is 
not compatible with the problem handled in this paper. Specifically, our VLIW 
datapath clusters are fundamentally different from ASUs, and thus so are the 
objectives driving the aggregation of operations to be executed on these hier- 
archical datapath components. In the ASU-based design style, the emphasis is 
on performance optimization via FU chaining [5]. Switching of data among 
the FUs internal to an ASU is done only through interconnect and MUXES. 
Thus, no resource sharing is allowed within an ASU. By contrast, in our clus- 
tered VLIW datapaths, a cluster’s local register file is used to switch data among 
the cluster’s FUs. Moreover, resource sharing within a cluster (while executing 
an aggregate) is not only possible, but highly desirable. Similar contrasts can be 
made to other high-level synthesis approaches, showing that their adopted struc- 
tural hierarchy defines a design space incompatible with the problem addressed 
in this paper. 

Retargetable code generation has received significant attention lately, see e.g., 
[9, 10]. As in the previous case, the algorithms developed for code generation 
solve optimization problems different from ours. During code generation, op- 
eration assignment to clusters (i.e., binding) and other code generation tasks 
are performed assuming a specific target datapath. In contrast, in our approach 
the binding of operations (aggregates) to clusters is performed for an “optimal” 
datapath that is being simultaneously generated. Deriving optimal code for a 
specified/target VLIW clustered datapath is a different problem from that of 
efficiently finding a VLIW clustered datapath that can deliver “maximum” per- 
formance, under the specified capacity constraints. 

Several lower bounds on latency have been proposed, see e.g. [12, 16, 7]. 
Such work usually assumes a pre-defined datapath with a ‘flat’ organization of 
FUs [12, 16]. An exception is [7]. In this case, a lower bound on latency is 
computed for a DFG bound to a specific clustered VLIW datapath. By contrast, 
the objective of our algorithm is to estimate the minimum latency that can be 
achieved for a given DFG over a family of clustered VLIW datapaths defined 
by the specified capacity constraints. Thus, once again the problems are quite 
distinct. 
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6. Conclusions 

We proposed an algorithm to support trade-off exploration during the early 
phases of the design of VLIW ASIPs with clustered datapaths. Encouraging ex- 
perimental results obtained for a number of benchmark kernels, assuming var- 
ious cluster capacities, show that our aggressive heuristic decomposition and 
pruning strategies work quite well in practice. We are currently working on 
incorporating high-level memory system design issues in our design space ex- 
ploration framework - these will be considered prior to the exploration step 
discussed in this paper. 
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Notes 

1 . ALAP(o) denotes the as-late-as-possible step of o for a given T L. 

2. We also remove edges that traverse a number of scheduling steps exceeding the the mobility of the 
activities that were extracted. 

3. This measure of load accounts for the fact that activities with higher mobility have more flexibility in 
their scheduling ranges, and thus should have lower importance in terms of assessing compatibility among 
layers. 

4. A covering is a collection of sets such that the union includes all the operations in the sub-IDFG and 
a partition is a cover of disjoint sets. 

5. The time frame of an operation o is given by [ASAP(o), ALAP(o)]. 

6. An unscheduled operation is said to be ready at step 5 if ^ is in its time frame. 

7. Due the high complexity of the optimization problem being tackled, verifying the optimality of a so- 
lution with respect to latency is virtually impossible. However, in all cases we have determined by inspection 
that for the given capacity constraints the latency penalty steps could not be reduced. 

8. Cluster area estimation is beyond the scope of this paper. 

9. Delay and power estimation are beyond the scope of this paper. 
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1. Introduction 

The dream of generating efficient logic implementations from higher level 
specifications originated with the works of Boole,[57] Shannon [58], Quine and 
McCluskey[35, 36]. Interest in logic synthesis grew during the 60’s and 70’s, as 
the computers being designed became more complex. Although many theoreti- 
cal advances were made, the first examples of practical synthesis did not occur 
until the later 70’s. Programmable logic arrays, PLAs, were minimized with 
the program, MINI, [15] and used on many product chips in IBM. LSS[10, 9] 
was the first example of production synthesis of gate array chips. It was based 
on local transformations and compiler techniques for optimizing logic, mapping 
gates to a specific technology. It was used on hundreds of product chips in IBM 
and was rewritten later with many significant refinements as BooleDozer[ll]. 
These optimization methods were also used in the first offerings from Synop- 
sys[12, 13], a company formed in 1986 (originally Optimal Solutions) to mar- 
ket synthesis technology developed at General Electric. Synopsys succeeded 
in bringing synthesis to the commercial market enabling a dramatic advance in 
design productivity. The years that followed saw many important developments 
that produced improvements in execution speed, quality of results and ability to 
deal with real technologies. Today, logic synthesis is a critical part of almost all 
chip development projects. 

The sub-areas of logic synthesis are many-fold, motivated by the increasing 
challenges of technology advancements. These include, technology indepen- 
dent methods including combinational and sequential synthesis [45, 48]; two- 
level, multi-level and algebraic methods [2, 40]; redundancy and related ATPG 
methods [46]; technology mapping for area, delay, power [44], testability and 
FPGAs; delay analysis [28, 33, 34], performance optimization [30, 55], and de- 
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lay testing [53, 54]; incremental synthesis; don’t cares and other flexibilities [42, 
39, 47, 41]; asynchronous synthesis; and multi-valued synthesis [38, 43]. A key 
improvement in synthesis technology was the development of the ROBDD [27] 
and the ever improving BDD packages[32]. These have become the modern-day 
truth table, and have made some of the initial theoretical advances in the ’60s and 
early ’70s quite practical. A similar development seems to be happening with 
the development of more efficient SAT solvers [29, 31]. Both BDD and SAT 
solving have had a major impact on the way logic expressions are represented 
and analyzed, which is at the very basis of efficient logic synthesis methods. 

Many of the advances in logic synthesis appeared in their initial form at 
ICCAD. Six of these papers have been selected for inclusion in this volume, 
based on the combined originality and excellence* of their main ideas. The se- 
lection was difficult, since there were many qualified papers, some of which are 
mentioned below. In most cases, significant advances occurred incrementally 
over the years, producing ever improving results, often finally displacing the 
original developments. The advance of technology and the companion growth 
in system complexity continue to demand further innovations in logic synthe- 
sis to provide the productivity improvements needed for designers to take full 
advantage of technology’s promise. 

We discuss the papers selected in chronological order and try to give them 
an historical perspective. All the papers have in common the fact that they in- 
troduced concepts that have been adopted and used in commercial synthesis 
programs. 

2. Selected Papers 
2.1 Espresso 

The first paper [23] from ICCAD 1986 and is one of several papers on the 
ESPRESSO PLA minimization programs [1]. These programs are still used 
extensively both commercially and academically. This paper represents a sig- 
nificant extension of ESPRESSO to both exact minimization and multi-valued 
logic [22, 24]. ESPRESSO, first developed in the summer of 1982 at IBM 
[37], went through several iterations before becoming the version used today. 
Frequently, it is used internally in many multi-level logic synthesis programs 
as a subroutine. ESPRESSO provides both exact and heuristic two-level min- 
imization. The extension described in the selected paper led to a wider set of 
applications (due to the multi-valued extensions) and improved run-time (due 
to improved data structures). The exact minimization procedure follows Quine- 
McCluskey, generating primes and solving a unate covering problem to obtain 
a minimum sum-of-products expression. Important innovations in each of these 
areas resulted in faster execution and the ability to be used on larger examples. 
During its development, ESPRESSO accumulated over 130 benchmarks for test- 
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ing. It was surprising at the time that ESPRESSO-EXACT could solve most of 
them in reasonable time, with only a handful that had too many primes to be 
solved. Interestingly, several years later, these hard test cases motivated sev- 
eral new research elforts [8, 21], using BDDs and new methods for generating 
useful primes, which ultimately solved the complete benchmark suite exactly. 

2.2 MIS 

The paper “Multiple-Level Logic Optimization System”, MIS, is also from 
the 1986 ICC AD [4], and describes a program development that grew out of 
the Yorktown Logic Editor, developed at IBM Yorktown Heights in the early 
80’s. MIS was the first program that used algebraic methods [2] for fast multi- 
level logic decomposition. It was based entirely on algorithmic methods rather 
than local transformations. The paper outlines the overall flow and the basic 
algorithms for extraction, re-substitution, factoring, decomposition and simpli- 
fication. Several commercial logic synthesis programs were based on improved 
versions of MIS [5] and a follow-on program SIS [25, 3] with extensions to se- 
quential synthesis. Over the years, MIS/SIS, which is available in source code 
form, has served as a platform for logic synthesis development and for compar- 
ing various algorithms. 

2.3 Global Flow 

LSS’s initial success at IBM was based on extensive use of local transforma- 
tions. Over time these suites of transforms were augmented or in some cases 
replaced by more systematic and comprehensive algorithms. “Improved Logic 
Optimization Using Global Flow Analysis” [6] is an excellent example of this 
progress. The paper describes improvements in the original global flow meth- 
ods [51]. These methods are related to later works on don’t cares [49, 52], 
redundancy addition and removal [46, 50], and recursive learning [47]. Here 
data-flow methods are employed to compute a “controlling” relation between 
signal pairs and then this summary data is used to enable a diverse set of op- 
timizations. This paper describes significant improvements resulting in better 
logic, quicker runtime, an improved ability to handle don’t cares, and a method 
of reducing connections, often a critical issue in real chips. 

2.4 Concurrent Optimization 

“A Method for Concurrent Decomposition and Factorization of Boolean Ex- 
pressions” (ICCAD 1990) [26] introduced a new algebraic technique. It focused 
only on “two-cube divisors” and provided an implementation where both single 
cube and two-cube divisors could be simultaneously evaluated for choosing the 
best one. This allowed for selecting a sequence of division steps which inter- 
leaved the use of single and multiple cube divisors. In addition, some divisors 
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could be evaluated in combination with their complements, when the comple- 
ment was a single or two cube expression, further improving the result. The 
paper also showed how to incrementally update the data structure after a divi- 
sor was chosen, resulting in a very fast implementation. One advantage of this 
approach is that there can be no more than a quadratic (in the number of cubes 
in the expression) number of such divisors, whereas in the use of kernels, an 
exponential number is possible. These methods, were shown to be competitive 
in quality with the use of kernels, but are much faster in many cases. They 
have been implemented in MIS and SIS and are the basis for their extraction 
procedures, which find common algebraic divisors. 

2.5 FPGA Synthesis 

“An Optimal Technology Mapping Algorithm for Delay Optimization in Look- 
up-Table Based FPGA Designs” (ICCAD 1992) [7] represents an important con- 
tribution in the area of FPGA synthesis, which will likely have increasing rele- 
vance in the future. The work shows how to map logic to an FPGA where the 
depth of the network (and hence delay) can be minimized exactly, not just for 
trees, but for more general DAGs, in polynomial time. In contrast, technology 
mapping^ for area for DAGs is NP-hard [17]. This paper, as well as the next one, 
inspired a later work [19], which showed, surprisingly, that technology mapping 
for optimum delay for DAGs, like tree mapping, can be done in linear time, 
(assuming that the delay is independent of the fanout load). 

2.6 Technology Mapping 

“Logic Decomposition During Technology Mapping” (ICCAD 1995) [20] 
describes a method for combining technology mapping and algebraic decompo- 
sition. The first breakthrough in technology mapping originated in the classical 
paper by Keutzer at Bell Labs[18], where the problem was shown to be simi- 
lar to compiler optimization [14] and could be decomposed into tree mapping, 
providing a linear-time algorithm. For many years this has been the method of 
choice, with later developments focusing on optimization for delay and power, 
as well as area. However, the step before technology mapping required that 
the logic be decomposed into generic basis functions such as 2-input NAND 
gates. This biased the final result because it required a priori decisions on how 
to decompose the logic functions algebraically. The paper by Lehman et. al. 
solved this by defining a new structure, a mapping graph, which can represent 
all possible algebraic decompositions. It was made efficient by identifying sub- 
structures that represented the same logic, similar to BDDs. The final map- 
ping used dynamic programming, visiting each node and finding a best match, 
similar to what is done in tree mapping. Later developments [16] improved on 
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area-delay tradeoffs and provided more details for an efficient implementation. 
The method was used in minimizing delay in DEC alpha machines. A detailed 
version of the work appeared in IEEE Trans. CAD 1997 [56] and received the 
best paper award for that year. 

3. Additional ICCAD Contributions 

The last twenty years of ICCAD included many important papers beyond 
the six included in this book. A select few of these are listed in chronological 
order along with a brief comment on each. The set represents solutions to a 
broad range of different problems arising in logic synthesis; two present elegant 
solutions for sequential optimization; however, such methods are rarely, if ever, 
used in practice, because of their impact on verification and other parts of a 
design methodology. 



■ KISS: A program for Optimal State Assignment for FSMs G. De Micheli, 
R. Brayton, and A. Sangiovanni-Vincentelli 1984. This paper described a 
state assignment method for two-level implementation of finite state ma- 
chines using multi-valued optimization and embedding of face constraints 
generated by the optimization result. 



■ Performance-Oriented Synthesis in the Yorktown Silicon Compiler, G. De 
Micheli 1986. This paper outlines a method for the overall optimiza- 
tion of delay in a combinational circuit by a sequence of re-synthesis, re- 
positioning and re-sizing steps as implemented in the YSC. This was prob- 
ably the first attempt at combining logic synthesis and physical design op- 
timizations, a technique essential for today’s technologies and present in 
most commercial synthesis systems. 



■ Multi-level Logic Optimization and the Rectangle Covering Problem, R. 
Brayton, Rick Rudell, A. Sangiovanni-Vincentelli, and A. Wang 1987. 
This paper gave an elegant solution for extracting best kernels from a list 
of all kernels generated from a set of logic functions found in a Boolean 
network. The solution is stated in terms of finding rectangles in an inci- 
dence matrix of cubes versus kernels. 



■ Observability Relations and Observability Don’t Cares, R. Brayton and 
H. Savoj 1991. This paper extended the theory of observability don’t 
cares to multi-output nodes in a Boolean network and gave a method for 
computing the maximum flexibility for such nodes in terms of Boolean 
relations. 
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■ Calculating Resetability and Reset Sequences, Carl Pixley 1991. This pa- 
per solved the problem of computing a reset sequence for a finite state 
machine after power up, when the machine could be in any state. It gives 
the condition when a machine is resettable and if it is, a method for com- 
puting a reset sequence. 

■ Analysis of Cyclic Combinational Circuits, Sharad Malik 1993. This pa- 
per re-examines the property of a circuit being combinational and ana- 
lyzes cyclic circuits for this property using a method based on temery 
simulation. 

■ Switching Activity Analysis Considering Spatiotemporal Correlations, R. 
Marculescu, D. Marculescu, and M. Pedram 1994. This paper looks at 
the problem of estimating power dissipation in a circuit and provides a 
method of computing switching activity using Markov chains to model 
both temporal correlations in a single signal and correlations between 
pairs of signals. 

■ Logic Optimization by Output Phase Assignment in Dynamic Logic Syn- 
thesis, R. Puri, A. Bjorksten, and T. Rosser 1996. This paper addresses 
the logic duplication problem encountered in dynamic logic. It solves, 
exactly and heuristically, the problem of minimizing the duplication re- 
quired by assigning output phases to the logic nodes. This method has 
been used by many companies for dynamic logic design. 

■ A New Method to Express Functional Permissibilities for LUT based FP- 
GAs and Its Applications, S. Yamashita, H. Sawada, A. Nagoya 1996. 
This paper describes a new method, originally applied to LUT-type FP- 
GAs, for computing flexibility in circuits. This flexibility, called SPFDs, 
is fundamentally different from don’t cares and potentially provides more 
flexibility in implementing circuits. 

Notes 

1 . Like Samuel Johnson “Your manuscript is both good and original, but the part that is good is not 
original, and the part that is original is not good”, these qualities were required to be non-disjoint. 

2. Technology mapping is the problem of mapping a net-list of logic functions into a given library of 
specific gates. 
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Abstract 

MIS is a multi-level logic synthesis and minimization system and is an integral part of the Berke- 
ley Synthesis Project. Mis starts from a description of a combinational logic macro-cell and 
produces an optimized set of logic equations which preserves the input-output behavior of the 
macro-cell. The system includes algorithms for minimizing the area required to implement the 
logic equations, and a global timing optimization step which is used to change the form of the 
logic equations along the critical path in order to meet system-level timing constraints. This pa- 
per provides an overview of the optimization system including the input language, the algorithms 
which minimize the area of the implementation, and the algorithms used to re-structure the logic 
network to meet the system-level timing constraints. Although the system is still under develop- 
ment, pieces of an industrially designed chip have been re-synthesized with mis and the results 
compare favorably with the manual designs. 



1. Introduction 

MIS is a multi-level logic synthesis and minimization system and is an integral 
part of the Berkeley Synthesis Project. The program bdsyn starts from a 
high-level description of a combinational logic macro-cell, and extracts a set of 
logic equations. MIS optimizes the logic equations and produces a logic network 
which preserves the input-output behavior of the macro-cell. Communication 
between bdsyn and MIS, and between MIS and the module generation tools is 
through the OCT database [6]. Module generators which synthesize a symbolic 
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layout for a macro-cell directly from an optimized logic network are also part of 
the Berkeley Synthesis Project. 

The optimization criterion is to minimize the area occupied by the logic equa- 
tions (which is measured as a function of the number of gates, transistors, and 
nets in the final set of equations) while simultaneously satisfying the timing con- 
straints derived from a system-level analysis of the chip. Timing constraints are 
also passed to the module generator tools to guide the placement and routing 
of the gates within a macro-cell, and similar constraints are passed to the floor- 
planning and placement and routing tools to guide the placement and routing of 
the macro-cells. 

In this paper, we briefly present the major components of the multiple-level 
logic optimization system. These include the input language, the global opti- 
mization strategy for area minimization under timing constraints, and the local 
optimization step for mapping the resulting set of logic equations into an imple- 
mentation. The system first minimizes the area without concern for the delay. 
Then Mis enters a timing pre-optimization step where the critical paths of the 
entire system are derived and the logic equations are re-structured along a cut-set 
of the critical paths to trade-off area for speed to meet the system timing require- 
ments. This re-structuring consists of collapsing logic functions to fewer levels 
and duplicating logic functions. Final timing optimization is done by sizing 
transistors, and by passing constraints to the physical design tools to influence 
the placement and routing of the critical paths. 

The system and the algorithms are general and hence largely independent of 
technology. Currently, MIS is targeted for complex-gate static CMOS designs. 
However, we have also considered the problems of fixed-gate technologies (i.e., 
fixed library cells) and dynamic CMOS design, and are confident that the same 
algorithms can be used for these other design styles. 

2. Specification of the Design 

MIS starts from a net-list description of a combinational logic network, where 
each block (i.e., gate) in the network is a multiple-input, single-output Boolean 
function. A network of this form is called a Boolean Network. A blip (Berkeley 
Logic Interchange Format) file is a textual representation of a Boolean network 
and can be used to communicate with tools outside the Oct environment. 

As part of the Berkeley Synthesis Project, we have developed a program 
which translates a high-level description of a piece of logic into a Boolean Net- 
work. The designer partitions a digital design into combinational blocks and the 
latches (or bus transceivers) which connect the logic blocks. The combinational 
blocks are described by programs written in bdsyn, which is a subset of the 
hardware description language bds. (bds is the behavioral language used as 
part of Digital Equipment Corporation’s mixed-mode simulation system.) The 
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latches and buses are connected to the logic blocks in a net-list. This net-list is 
the starting point for the floor-planning and placement and routing tools in the 
Berkeley Synthesis Project. 

BDS is a programming language with the built-in data type of a bit- vector and 
the basic operations on bit-vectors (i.e., bit selection, logical operations, shift 
and rotate operations, and arithmetic operations). A program written in bdsyn 
has a well-defined set of inputs and outputs, and is assigned the semantics that 
the output is computed as a combinational function of the inputs. An arbitrary 
BDS description is allowed (including functions, loops, global variables, and 
multiple-assignment to variables). The program bdsyn extracts from this de- 
scription a Boolean Network (i.e., a set of logic equations) with input/output 
behavior equivalent to the bdsyn description. It is important to note that these 
equations consist of many levels of logic thus removing any limitation on the 
type of logic described. We are currently investigating extensions to bdsyn 
which will allow the designer to explicitly specify don’t-care conditions in the 
logic network. Our experience shows that allowing the designer to specify 
this information to the system can dramatically improve the optimization of the 
logic. 

Figure 1 shows the BDSYN description of the register file decoder from the 
SPUR microprocessor [5]. The register file decoder is complicated by the use of 
the overlapping window scheme [7]. cwp is the current window pointer for the 
register file, and reg is the index of the register within the window. A one-hot 
signal addr selects one of the 138 registers in the register file. The operator & is 
bit concatenation, the operator ZXT performs an extension of an operand with 
zero fill, and the operator SLO performs a shift-left with zero fill. Note the use of 
complex operations such as addition, comparison, and variable shift left (all of 
which are supported in bdsyn). Also, the variable address is assigned to more 
than once. 

3. MIS 

MIS is set up as an interactive logic synthesis system to allow us to experiment 
with different algorithms. Mis can also be run in a batch mode, as an automatic 
synthesis system would require, by a script file of MIS commands. The list of 
MIS commands is quite extensive and continues to grow as more development 
and experimentation is done. Currently, commands in MIS are concerned with 
eliminating nodes, simplifying nodes, substituting one node into another, fac- 
toring nodes, breaking nodes into smaller components, finding common kernels 
and common cubes, reading and writing logic files, factoring and complement- 
ing the functions, and displaying information about the Boolean Network. Many 
of these operations are broken down into different versions with varying effi- 
ciency. A typical Mis script file that may be used for automatic synthesis is 
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MODEL decoder addr<137:0> = cwp<2:0>, reg<4:0>; 
BEHAVIOR; 

CONSTANT NUMREGS = 138, NUMGLOBALS = 10; 

ROUTINE main; 

STATE address<7:0>; 

! Check for reference to global register 
if reg leq 9 then 
address = reg 
else begin 

! compute address, check overflow 
address = (cwp ft 0000#2) + reg; 
if address gtr (NUMREGS - 1) then 

address = address - (NUMREGS - NUMGLOBALS); 

end; 

! Create the one-hot decode based on the address 
addr = (ZXT {width=138} 1) SLO address; 

ENDROUTINE main; 

ENDBEHAVIOR; 

ENDMODEL decoder; 



Figure 1. BDSYN Example. 



shown in Figure 2. A partial list of MIS conunands and a short explanation of 
each is shown in Figure 3. Some of these operations are explained further below. 

By changing the script file and changing some of the parameters of the op- 
erators, the file can be tailored to the kind of logic at hand. However, we need 
more experience to determine how to do this automatically. 

3.1 Global Area Optimization 

The goal of this step is to reduce the complexity of the logic equations us- 
ing global techniques which allow significant re-structuring of the network, and 
which are independent of the particular design style or technology. Each node in 
the Boolean network is a completely specified Boolean function, and is stored 
in sum-of-products form. We use as a measure of the complexity of a logic 
equation the number of literals required to represent the function in a factored 
form. The cost for a Boolean network is the sum of the costs for each node. For 
example, the function 

f\ = abeg + abfg + abeg + aceg -I- acfg -1- aceg + 
deg -t- dfg deg + bh + bi + ch + ci 
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fi={a{b + c) + d){eg+g{f+e)) + {b + c){h + i) . 

As will be mentioned later, there are several different algorithms which can 
be used to factor a single logic equation, each with a different performance and 
quality tradeoff. Because the cost function must be evaluated frequently during 
the synthesis, we favor a fast factoring algorithm during the early stages of the 
synthesis. 

To explain the reasoning behind the cost function, consider that f\ may be im- 
plemented as a single CMOS complex-gate using 13 transistor pairs. Or, because 
of various technology reasons, the gate might require further decomposition into 
smaller gates. However, when this function is broken into smaller gates, it can 
be broken down in a manner corresponding to its factored form; hence, the tran- 
sistor count does not increase greatly. The total number of literals in the factored 
form is a good measure of the number of transistors in the network. 

Now assume that f\ was identified as a common factor to several other func- 
tions, and has been created to reduce the total complexity of the network. Re- 
gardless of how /i is implemented (either as a single gate, or requiring further 
decomposition), it has reduced the number of transistors in the network by an 
amount corresponding to the total number of transistors that have been saved in 
the functions which it feeds minus the number of transistors in f \ . 

3.1.1 Identifying Global Commonality. The most important 
(and most difficult) step in global minimization is to identify factors common to 
two or more functions which can be used to reduce the total number of literals in 
the network. To accomplish this, we use two basic operations: generate common 
algebraic factors from a set of logic functions {extraction), and checking whether 
an existing function is a factor of some other function {resubstitution). 

3.1.2 Extraction. Based on the notion of a kernel [2], the set of all 
useful common algebraic divisors of a set of equations consists of all kernels and 
their intersections, and common single cubes. We thus limit our search space to 
first finding common kernels and common intersections of kernels, and then 
common single cube divisors. We can demonstrate that these two operations are 
computationally equivalent and can be formulated as the problem of finding a 
minimum rectangle cover of a 0-1 matrix. 

For example, let us consider the problem of finding common single cube divi- 
sors. Suppose we have a number of functions which are each described 

as a sum of products. Then we can construct a 0-1 matrix, B, such that row i 
represents the i'* product, and column j represents the /* literal. A 1 is inserted 
in element (i, j) if literal j is in product i. A rectangle is defined as a subset of 
rows R and columns C such that for all {i,j) where i E R and y G C we have 
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Bij = 1. Then a rectangle represents a common subproduct which can be found 
in each of the products associated with the subset of rows R. An optimal rectan- 
gle covering of B is then an optimal set of subproducts which can be extracted 
and implemented as separate functions, and each of these used to simplify the 
functions fj. 

This covering problem has analogies to the problem of Boolean minimiza- 
tion. For example, we can define prime rectangles, and can solve the problem 
exactly using an algorithm very similar to the Quine-McCluskey algorithm for 
Boolean minimization. Because of the computational complexity of this ex- 
act solution to this problem, we are investigating adapting iterative, heuristic 
Boolean minimization algorithms (such as ESPRESSO-II [4]) to this problem. 

3.1.3 Resubstitution. Once the factors have been determined, al- 
gebraic resubstitution is used to simplify the network. For example, suppose the 
network is: 



/i = ac-\-ad + bc + bd-\-e 
f 2 = a-\-b. 

Function f\ can be rewritten as: 

/i =f 2 (c-\-d)-\-e. 

This operation is called algebraic resubstitution since a -I- b is an algebraic divi- 
SOT of ac ad + be -[- bd e = {a-\-b){c-\-d)-\-e. 

However, algebraic techniques alone do not exploit all of the Boolean prop- 
erties of the logic equations. To improve the results, we also perform Boolean 
resubstitution, which is to check if any of the existing functions is a divisor of 
other functions in the network in the Boolean sense. Boolean resubstitution is 
capable of providing better results, but in general takes much longer than alge- 
braic resubstitution. For example, algebraic resubstitution does not simplify the 
following network: 

/i = {ah -f- cd)ef -I- {ab -f- ef)cd -I- {cd ef)db 
= ab + cd-\- ef 

However, using Boolean resubstitution, f\ can be rewritten as 

f\= fiil^+cd + ej) 

for a savings of 1 1 literals. 

It is known that algebraic resubstitution can be performed in time 0((|/| -I- 
\g\)log{\f\ -f |g|)) when the functions / and g have |/| and |g| product terms re- 
spectively [2]. However, Boolean resubstitution involves a Boolean minimiza- 
tion step with a proper don’t-care set. Hence, heuristics are needed to decide 
when to use Boolean resubstitution in order to reduce the execution time. 
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3.1.4 Phase Assignment. Each function is represented as a pos- 
itive logic, factored expression. Also, each intermediate variable is assumed 
present in both its true and complemented form. In practice, logic gates in 
static CMOS are inverting, and not all intermediate variables are needed in both 
phases. Hence, there is an optimization problem of choosing the phase of each 
intermediate function to minimize the number of inverters needed to implement 
the network. Global phase assignment is performed which determines for each 
function whether to implement the function or its complement. We are experi- 
menting with several heuristic solutions to this problem. 

3.2 Local Optimization 

Local optimizations refer to operations performed on a single logic function, 
or locally in the surrounding neighborhood of the logic function. Our synthesis 
system uses the local transformations of: decomposing large gates into smaller 
ones, deriving better implementations of the gates, and simplifying each gate 
using knowledge of its local environment. 

3.2.1 Local Decomposition of a Gate. We use two different 
algorithms to decompose a gate into a set of smaller gates. Good alge- 
braic decomposition uses a greedy strategy based on selecting the best kernel 
and pulling out this kernel. Quick decomposition is done by generating only a 
single level-0 kernel at each step. Quick decomposition is computationally less 
expensive than generating all kernels and their intersections (in some cases, it is 
as much as 5 times faster) and has been shown to produce good results (often 
identical results). We will illustrate decomposition by referring to the examples 
given below for factoring, since decomposition and factoring are very similar 
operations. 

3.2.2 Factoring a Gate. We are currently targeting a complex- 
gate CMOS design style. Hence, another local optimization is to factor the logic 
equation of a single gate in order to produce an optimal pull-down (and pull- 
up network) for the gate. Different factoring algorithms have been explored. 
Each has its own use and run time cost. Quick factoring (a modification of 
Quick Decomposition) is useful to estimate the cost of an implementation. The 
number of transistors in a quickly factored form gives a good estimate of the 
number of literals in a complex-gate CMOS implementation and the cost function 
is evaluated frequently during the synthesis. Two other factoring algorithms are 
also implemented called good algebraic factoring and Boolean factoring. Even 
though they are relatively expensive operations, they occasionally yield better 
factorizations, and are used to derive the final implementations of the gates. 
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To illustrate the different results that can be obtained by these various factor- 
ing algorithms, consider the function 

/i = ac + ad + ae + ag + bc + bd + be + bf+ce + cf + df + dg. 

Quick factoring (QF) and good factoring (GF) give different but similar quality 
factorings, 

GF{f\) = {c-\-d^-e){a-\-b)-\-f{b-\-c-\-d)+g{a + d) + ce 
Q^{f\) — g{a-\-d) + {a-\-b){c-\-d-{-e) + c{eA- f)+ f{b + d) 

but QF is much faster because it only needs to determine one level 0 kernel for 
each factor. However, for 

/i = abeg + abfg + abeg + aceg + acfg + aceg + 
deg + dfg + deg + bh + bi + ch + ci 

the factored results are: 

QP{f\) = {a{g{e->rf) + eg) + iA-h){b-\-c)+d{g{e-\-f) + eg) 
GF{f\) = {a{b + c)+d){eg + g{f + e)) + {b + c){h + i). 

In this example, GF obtains the better result (13 literals versus 16 literals) by 
working harder to find the best kernel at the top level of the recursion. To illus- 
trate Boolean factoring (BF), consider 

f\ = ab + ac + ad + db + be + bd + dc + bc + cd + dd + bd + cd. 

We obtain 

BF{fi) = (^a-\- b c d){aA' b A" c d)-, 

whereas 

QF[fi) = GF{fi) = a[b-{-c -bd) 4- (a-l- l>)(c-l-d) -H c(&4-d) -I- cd. 

QF has become an important tool since it is so fast and effective. QF is used 
continuously during synthesis to estimate area and delay. 

3.2.3 Simplification. We employ three forms of simplification an- 
other powerful local transformation. This step minimizes the function of a single 
equation using the algorithms of espresso-ii [4]. Depending on the stage of the 
optimization, m ini m ization algorithms of differing cost/performance trade-offs 
are used. (For example, quick simplification [4] (Chapter 3) is used at the begin- 
ning of the synthesis. Later, towards the end of the synthesis, the full power of 
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the ESPRESSO-II minimization algorithm is used with a don’t-care set derived 
from the environment of the function [3]. For example, for the Boolean network 

u = b+y+z 
V = z + a 
z = yad 
y = be 

with outputs M,v and intermediate nodes y,z, if we simplify function z, we can 
write down a don’t-care set which takes into account the environment of node z 
embedded in the Boolean network: 

DC = 4- c) 4- ybc +{b+ y)d. 

Then using DC to simplify the function yad we obtain yd. A general method for 
simplifying a node in a Boolean network using two-level minimization and an 
appropriate don’t-care set is given in [1]. 

3.3 Timing Pre-Optimization 

To support timing optimization, we have a set of routines to break large gates 
into smaller ones, or to combine small gates into larger ones, in order to reduce 
the delay. 

Our first problem is to estimate the delay of the network. Our goal is a rea- 
sonably accurate, relative measure of the speed of the circuit in terms of the 
number of gates, number of fanouts, and the size of the gates along a critical 
path. We took the following approach to estimate the delay through each gate. 
First, for each input, we translate a gate of arbitrary series-parallel transistors 
into an equivalent n-inputs nand gate. A nand gate is characterized by the 
number of inputs, transistor widths, and load capacitance. The delay is modeled 
with a polynomial function of these parameters. The coefficients of the equation 
are determined by the least square fitting of the curve to a large set of spice 
results for various NAND gates. Two sets of coefficients are used for CMOS cir- 
cuits, one for the P-type transistors and one for the N-type. For each input to a 
gate, we compute the delay caused by the gate assuming that the given input is 
critical. This is done for both the pull-up phase and the pull-down phase. The 
critical paths are then traced through the appropriate input to output delays, al- 
ternating between gates from pull-up to pull-down and vice versa. By using this 
mechanism, we obtain better delay estimates and eliminate many false paths in 
the critical delay calculations. 

The optimization loop involves, given a set of timing constraints, computing 
delays through each gate, identifying all the critical paths, finding a minimum 
weighted node cut set of the critical paths, and the re-synthesis of these nodes. 
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The weight of a node in the critical path is a measure of the number of literals on 
the critical path which are saved by having the node present. Finding a minimum 
weighted node cut set is equivalent of finding the maximum-flow/minimum-cut 
in a flow network. Because of the computational efficiency of the routines to 
compute delays and the cut-sets, we are able to iterate over these operations 
many times until the timing constraints are satisfied, or until it is apparent that 
they cannot be satisfied. 

Because our timing estimates are necessarily inaccurate, we view the result 
of the optimization not as a precise delay calculation, but rather as a reasonable 
guide for re-structuring the architecture of the network to meet the timing con- 
straints. Also, the timing estimates are reasonable specifications for the module 
generators to use when synthesizing and placing the gates. More accurate tim- 
ing estimates and verifications may be employed later in the design cycle when 
the details of the gate designs and placements are known more precisely. 

4. Results 

Although work is continuing on MIS, Jacob Thomas of Advanced Micro De- 
vices became interested in testing the performance of MIS on some industrial 
circuits. The function of each of these circuits was taken from an actual chip de- 
sign, and had previously been designed by a human designer. In each case, the 
logic network was designed starting from a bdsyn description of the behavior 
of the network. 

The target technology was a fixed library NA nd/nor with up to 4 inputs per 
gate - no complex gates were used. MIS is currently oriented towards complex 
gate CMOS, so a manual translation was performed on the output of Mis to this 
technology. No optimization or merging of gates was done during the transla- 
tion so we expect these results to be an upper bound on the quality of the final 
network. (It is expected that better results could be obtained by performing lo- 
cal optimization over the final network taking into account the constraints of the 
technology.) The results of this experiment are given in Figure 4. 

The results from the first two examples are very encouraging. The difference 
in the third example is explained, in part, from the fact that MIS is currently 
unable to use the don’t-care information inherent in the network. In a second 
experiment, listed as ex3b, the network was minimized as a two-level PLA with a 
don’t-care set, and then re-extracted into a multiple-level network. This network 
was not mapped into the technology for a direct comparison; however, it appears 
to be about two-thirds of the size. In a third experiment, listed as ex3c, we 
used MIS interactively, trying some of the more powerful operators, especially 
Boolean resubstitution, and obtained a further reduction to an estimated 10% 
increase over the manual design. We expect to improve on this result as MIS 
matures. 
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sweep 
re sub 

eliminate -1 
re sub 

eliminate -1 
re sub 

fastdecomp * 
re sub 

eliminate 0 
re sub 

eliminate 0 
simplifyO * 
re sub 

kextract 5 
eliminate 0 
simplifyl * 
re sub 

fastdecomp * 
re sub 
cextract 
eliminate 0 
re sub 
verify 20 



Figure 2. A Typical MIS Script. 
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GLOBAL OPTIMIZATION 


collapse : 


collapse the network to two level 


kextract : 


extract kernels 


qextract : 


extract level 0 kernels 


cextract : 


extract single cube factors 


re sub: 


resubstitute node to all other node 


eliminate : 


eliminate nodes below threshold 


sweep : 


eliminate 0-fanout or 1-fanin nodes 


quickphase : 


quick phase assignment 


goodphase : 


good phase assignment 


LOCAL OPTIMIZATION 


complement : 


compute the complement 


decomp ; 


decompose large gates 


fastdecomp: 


fast decomposition 


simplifyO: 


simplify using SIMPCOMP 


simplif yl : 


simplify using ESPRESSO only 


simplify2: 


simplify using ESPRESSO with don’t care set 


strongd : 


strong division one node by another 


invert : 


invert nodes 


qf actor : 


factor nodes using quick-factoring 


gf actor : 


factor nodes using good-factoring 


kernel : 


generate kernels 


addinv : 


add inverters as needed 


TIMING OPTIMIZATION 


reduce : 


reduce the delay through a critical node 


speedup: 


speed up the network by percentage 


delay: 


calculate delays 


setslack: 


set the output slacks 


MISCELLANEOUS 


verify: 


verify current network against original 


lopen: 


open a log file 


Iclose: 


close the current log file 


backup : 


backup the current copy of network 


restore: 


restore the backup copy of the network 


undo: 


imdo last command that changed network 


source : 


execute MIS commands in a file 



Figure 3. Partial list of MIS commands. 
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Manual Design 




MIS Design 


Ratio 




Gate 


Device 


Logic 


Gate 


Device 


Logic 


of 


Example 


Count 


Count 


Levels 


Count 


Count 


Levels 


Devices 


exl 


54 


245 


7 


46 


206 


6 


0.84 


ex2 


95 


406 


6 


69 


306 


9 


0.75 


ex3 


162 


814 


6 


445 


1965 


12 


2.50 


ex3b 


162 


814 


6 


- 


1250 


- 


1.53 


ex3c 


162 


814 


6 


- 


900 


- 


1.10 



Figure 4- MIS results compared to manual design. 
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Abstract 

We present an algorithm for determining the minimum representation of an incompletely-speci- 
fied, multiple-valued input, binary-valued output, function. The overall strategy is similar to the 
well-known Quine-McCluskey algorithm; however, the techniques used to solve each step are 
new. The advantages of the algorithm include a fast technique for detecting and eliminating from 
further consideration the essential prime implicants and the totally redundant prime implicants, 
and a fast technique for generating a reduced form of the prime implicant table. The minimum 
cover problem is solved with a branch and bound algorithm using a maximal independent set 
heuristic to control the selection of a branching variable and the bounding. Using this algorithm, 
we have derived minimum representations for several mathematical functions whose unsuccess- 
ful exact minimization has been previously reported in the literature. The exact algorithm has 
been used to ^determine the efficiency and solution quality provided by the heuristic minimizer 
Espresso-MV [11] Also, a detailed comparison with McBoole [2] shows that the algorithm pre- 
sented here is able to solve a larger percentage of the problems from a set of industrial examples 
within a fixed allocation of computer resources. 



1. Introduction 

Programmable Logic Arrays (pla’s) are an important design style in digital 
integrated circuit design [4]. The optimization of pla’s includes logic opti- 
mization steps (e.g., output phase assignment [4, 13], input variable assignment 
to decoders [4, 13] and classical minimization of the logic equations [7, 6, 1]) 
and topological optimization (e.g., pla folding [5]). In this paper, we consider 
the problem of determining a minimum equivalent representation for multiple- 
valued input, binary-valued output functions. By stating the algorithms in 
terms of multiple-valued functions, we immediately consider the minimization 
of single- and multiple-output pla’s, the minimization of single- and multiple- 
output pla’s with input decoders of arbitrary size, and the minimization of ar- 
bitrary multiple-valued input, binary-valued output functions. 

The primary goal for minimization algorithms for PLA design is to minimize 
the total number of terms needed for the sum-of-products representation of the 
logic function. This reflects the goal of minimizing the area (and to a first-order 
approximation, the delay) of the PLA. As a secondary concern, we seek the 
representation requiring the fewest transistors in the pla to reduce the inter- 
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nal capacitance and increase possibilities for folding. Exact minimization algo- 
rithms consist of the following steps: (1) generate all of the prime implicants 
of the switching function and form the prime implicant table', and, (2) derive a 
minimum cover for the prime implicant table. 

Many different algorithms have been presented for solving these two prob- 
lems dating back to the early work of Quine [8]. The established 
Quine-McCluskey [7] procedure for Boolean minimization generates the prime 
implicants starting from a list of minterms. The prime implicant table has a row 
for each minterm and a column for each prime implicant. For each minterm 
row, a 1 is placed in a column if the corresponding prime implicant contains the 
minterm. The problem of selecting a minimum subset of primes has then been 
mapped into the problem of selecting a minimum cover of this matrix. (A cover 
is a row vector of O’s and 1 ’s such that each row of the matrix shares a 1 in some 
column with the row vector, and a minimum cover is one with the fewest number 
of I’s.) Before the covering problem is solved, standard techniques such as de- 
tecting essential columns, and deleting dominated rows and dominated columns 
are used to create a reduced prime implicant table. 

The basic terms are defined in Section 2, and the minimization procedure is 
described in detail in Section 3. Results from an implementation of the proce- 
dure are given in Section 4. 

2. Definitions 

Definition 2.1: Let p,- for / = l,...,n be positive integers representing the 
number of values for each of n variables. Define the set P,' = {0, ...,p/ — 1} for 
j = 1, . . . ,n which represents the pi values that variable i may assume, and define 
P = {0, 1, *} which represents the value of the function. A multiple-valued in- 
put, binary-valued output function, /, (hereafter known as a multiple- valued 
function) is a mapping 

/ : Pi X ?2 X ... X P„ ^ B 

The function is said to have n multiple- valued inputs, and variable i is said to 
take on one of pi possible values. Each element in the domain of the function is 
called a minterm of the function. 

The value * € B will represent a minterm for which the function value is 
allowed to be either 0 or 1. Hence, we allow functions which are incompletely 
specified. 

Remark: An n-input, m-output switching function can be represented by a 
multiple- valued function of n-h 1 variables where p/ = 2 for i = l,...,n, and 
p„+i = m. This special case is called a multiple-output function. It is easily 
proven that the minimization problem for multiple-output functions is equivalent 
to the minimization of a multiple- valued function of this form [12] (Theorem 
4.1). Likewise, the minimization problem for a PLA with input decoders [4] is 
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also equivalent to the minimization of a multiple- valued function [12] (Theorem 

2 . 1 ). 

Definition 2.2: Let X,- be a variable taking a value from the set Pi, and let 5/ 
be a subset of Pi. Xi^‘ represents the Boolean function 

ifXi^Si 
‘ \l ifXiESi 

Xi^‘ is called a literal of variable X,-. If Si = 0, then the value of the literal is 
always 0, and the literal is called empty. If 5, = Pi, then the value of the literal 
is always 1, and the literal is called full. 

A product term is a Boolean product (and) of literals. If a product term 
evaluates to 1 for a given minterm, the product term is said to contain the 
minterm. 

If all literals in a product term are full, the product term contains all minterms, 
and is called the universal product term. 

A sum-of-products (also called a cover) is a Boolean sum (or OR) of product 
terms. If any product term in the sum-of-products evaluates to 1 for a given 
minterm, then the sum-of-products is said to contain the minterm. 

The ON-set is the set of minterms for which the function value is 1. Likewise, 
the OFF-set is the set of minterms for which the function value is 0, and the 
DC-set is the set of minterms for which the function value is unspecified. 

An algebraic expression for / is a Boolean expression (written using Boolean 
sums and Boolean products of literals) which evaluates to 1 for all minterms of 
the ON-set, evaluates to 0 for all minterms of the OFF-set, and evaluates to either 
0 or 1 for all minterms of the DC-set. 

Proposition 2.1: An algebraic expression for / can always be written in sum- 
of-products form. 

An implicant of a function / is a product term which does not contain any 
minterm in the OFF-set of the function. A prime implicant of a function / is an 
implicant which is contained by no other implicant of the function. 

The Boolean Minimization Problem is to determine the minimum cost sum- 
of-products representation of a given multiple-valued function. For a PLA imple- 
mentation, the cost function is to minimize the total number of product terms in 
the representation. This is the cost function which is considered here, although 
extension to a more general cost function is also possible. 



2.1 Positional Cube Notation 

Let Xj^'x|^...X^" be a product term. This product term can be represented by 
a binary vector: 



c?c}...cf -c^c\...cf 



-1 



-cV 



rPn-^ 
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where cj = 0 if j ^ Si, and Cy = 1 if y G 5,-. This is called the positional cube 
notation or more simply a cube [17]. A cube is a convenient representation 
for a product term, and the terms cube and product term will often be used 
interchangeably. 

The notation c/ represents the binary vector c9c? ...cf'~\ and lc;| represents 
the number of I’s in the binary vector. 

3. Algorithms 

The algorithms presented here for solving the basic steps of the minimization 
problem are based on the heuristic algorithms which are used in Espresso-MV 
[10, 11]. Hence, we will refer to this algorithm as Espresso-EXACT. We as- 
sume we are given a cover of the ON-set of a multiple-valued function (called 
F), a cover of the DC-set (called D), and a cover of the OFF-set (called R). (If 
the complement is not provided, it may be efficiently computed using a fast 
multiple-valued complementation algorithm [15, 10]). The Espresso-EXACT 
Minimization Algorithm is: 

1 Generate from R all of the prime implicants (F) of the function. 

2 Detect the essential primes (F), the totally redundant primes (/?,), and the 
partially redundant primes (Rp) using a fast tautology algorithm. 

3 Create the prime implicant table (A) for the partially redundant primes 
using a modification of the tautology algorithm to record how a function 
avoids being a tautology. 

4 Generate a minimum cover for the prime implicant table. 

5 Select the primes which are in the cover for the minimum cover. 

6 Heuristically remove redundant literals from the cover using the 
MAKE-SPARSE heuristic of Espresso-MV [10]. 

3.1 Generation of Prime Implicants 

Many techniques for determining all of the prime implicants of binary-valued 
single-output and multiple-output logic functions have been published in the 
past [7, 18]. More recently, an algorithm based on recursive decomposition of a 
function followed by a pairwise consensus operation has been reported [3], and 
has been improved upon in the program McBoole [2] to reduce the total number 
of consensus operations which need to be performed. In this paper, we mention 
another paradigm for generating all of the prime implicants, which yields two 
slightly different algorithms. 

In order for a cube c to be an implicant of F, c must not intersect each cube 
r' G R. This can be expressed by writing a Boolean expression. Let Cy be a 
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Boolean variable representing the condition that part k of variable j of cube c be 
set to 1. Let {r‘)j have the value of 1 if part k of variable j of the cube r* is a 1. 
The following Boolean expression asserts that c does not intersect the OFF-set 
of the function: 

1^1 n 

'=nu n 

i=l j=zl k=0 
\r‘j¥Pj 

• k 

We stress that the values of r‘ written as (r')^ are known values of either 0 or 1, 
and that the variables in the above equation are 

To form a sum-of-products representation of I requires that the product-of- 
sums-of-products expression be “multiplied-out”, that is, repeated intersection 
of sums-of-products covers. However, using DeMorgan’s law, we can directly 
write the expression for 7: 



1^1 n Pj-l 

?=un u 

i=l j=i M 

An implicant of the function I corresponds to an assignment of 0, 1 to the vari- 
ables Cy which results in an implicant of /. Further, a prime implicant of I 

corresponds to an assignment of 0, 1 to the variables Cj which is maximal in 
the sense that no other variable which is 0 can be made a 1; therefore, a prime 
implicant of I corresponds to a prime implicant of /. 

By construction, the logic function / is a two-valued unate logic function. 
Hence, any prime cover for I consists of all of the prime implicants for / [1] 
(Prop. 3.3.7). 

This construction proposes two techniques for generating all of the prime 
implicants of a function: one which involves repeated intersection of sum- 
of-products forms and one which involves the complementation of a sum-of- 
products form. We note here that the first formulation is equivalent to the 
technique outlined by Roth [9] for generating all of the prime implicants of a 
multiple-output logic function. 

3.2 Separation of E, Rt and Rp 

The primes P are split into the essential set E, the totally redundant set Rt, 
and the partially redundant set Rp according to the following rules: 

E = {ceP\c(^{P\JD-c)] 

Rt = {ceP,c^E\cC{PL>D — c)} 

Rp = P-{E[JRt) 
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The cubes of E must belong to any cover of the function, (it is the set of essential 
prime implicants), and no cube of Rt can ever belong to a minimum cover of F 
(it is the set of implicants dominated by the essential prime implicants). The 
cubes of Rp are partially redundant because, although any single cube of Rp can 
be removed, it is not possible to simultaneously remove all of the cubes of Rp 
while still maintaining a cover of F. Rp causes the most difficulty in trying to 
extract a minimum subset of P. 

The separation of P into the covers E, Rt, and Rp is accomplished with a fast 
multiple-valued tautology algorithm [14, 10]. The basic test c C H is done by 
forming the cofactor He, and then testing if He is a tautology (i.e., if the function 
evaluates to 1 for all inputs). The tautology check uses a generalization of the 
Shannon Cofactor [10] as follows: 

Proposition 3.1: Let c‘,i= 1, ... ,m be a set of cubes satisfying U[^jc' = 1 
and c' n C'^ = 0 for i ^ j. Then, 



m 

F = |Jc'nF^ 

i=l 

It is easy to show that the tautology question for a function F can be answered 
by the following proposition: 

Proposition 3.2: F = 1 ■O’ F^/ = 1 for t = 1, . . . ,m 

The recursion is terminated when F^ is trivial enough to allow a direct check 
for tautology. In particular, if the function Fsubcubeci is weakly-unate [10], the 
recursion can be terminated immediately. 

3.3 Creating the Reduced Prime Implicant Table 

The techniques for extracting a maximal subset of primes from Rp is now de- 
scribed. The key in the algorithm is a simple modification of the multiple-valued 
tautology algorithm. Rather than testing whether the function is a tautology, we 
determine which subsets of cubes in a function would have to be removed to 
prevent a function from becoming a tautology. 

For each cube c € Rp, form H = F UFp - c, and use a multiple-valued tau- 
tology algorithm to determine if He is a tautology. By definition of E and Rp, 
He must be a tautology (every cube of Rp is covered by the union of E and the 
remaining cubes of Rp). At each leaf in this tautology algorithm (i.e., where it is 
decided that the partial function is a tautology), the cubes which are in the cover 
at this leaf are examined. If there is a cube from E (or D) which is the universe 
in the restriction of the function corresponding to this leaf, then this leaf will 
always give rise to a tautology regardless of which cubes of Rp are discarded. 
Otherwise, all of the cubes of Rp which are the universe (in this leaf) must be 
removed in order to avoid having He become a tautology in this leaf. In terms 
of determining how a cover covers the cube, this is equivalent to saying that H 
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will fail to cover c if and only if all of the cubes of Rp which are the universe in 
this leaf are discarded. In this way, we enumerate all possible ways for subsets 
of Rp to avoid covering the original function. 

A binary matrix is formed where each cube of Rp is associated with a column. 
At each leaf in the tautology algorithm where no cube from E is the universal 
cube, a row is added to the binary matrix with a 1 for each column where {Rp)‘ is 
the universe. A minimal cover of this Boolean matrix corresponds to a minimal 
subset of the primes of Rp which must be retained in the cover for F. 

The algorithm proceeds by forming He for each c G Rp, and calling a modified 
version of the TAUTOLOGY procedure called find.tautology. Note that after 
determining how c can be covered, c can be moved to the set E thus improving 
the performance of the algorithm (because it is known how all of the minterms 
of c can be covered by selecting primes from Rp). 

The binary matrix formed in this way can be related to the prime implicant ta- 
ble of the Quine-McCluskey algorithm. This binary matrix is a reduced form of 
the prime implicant table; rather than each row of the matrix corresponding to a 
minterm of the function, each row corresponds to a collection of minterms (i.e., 
a larger subspace) all of which are covered by the same set of prime implicants. 
In the limit, the tautology algorithm will terminate at each of the minterms of the 
function, thus producing precisely the prime implicant table. However, in prac- 
tice, the algorithm is terminated much more quickly, leading to a more efficient 
creation of the prime implicant table. 

3.4 Covering Algorithm 

The standard branch-and-bound solution to the minimum cover problem in- 
volves the following steps: 

1 Remove rows which are dominated by other rows (i.e., the row contains 
some other rows), and remove columns which are dominated by other 
columns (i.e., the column is contained in some other column). Detect 
essential columns (a row with a single 1 identifies an essential column) 
and add these to the selected set. Repeat until no new essential elements 
are detected. 

2 If the size of the selected set exceeds the best solution so far, return from 
this level of the recursion. Or, if there are no elements left to be covered, 
declare the selected set as the best solution recorded so far. 

3 Select (heuristically) a branching column. 

4 Add this column to the selected set and recur for the submatrix resulting 
from deleting this column and all rows which are covered by this column. 
Then, recur for the submatrix resulting from deleting this column without 
adding it to the selected set. 
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We propose the following two heuristics to speed up this standard covering 
algorithm: 

1 At step 2, determine a lower bound on the size of a cover for the current 
submatrix. This can be done by determining a maximal set of rows all 
of which are pairwise disjoint (called a maximal independent set of rows) 
using a straightforward, greedy algorithm. (Note that the problem of find- 
ing a maximum independent set of rows is itself NP-complete; however, 
we need only a large independent set of rows.) Because each row .must 
be covered, and all of the rows in the maximal independent set share no 
column in common, the size of the maximal independent set is a lower 
bound on the number of columns needed to complete the cover. At step 2, 
the recursion can be bounded if the size of the selected set at step 2 plus 
the size of the maximal independent set of rows equals or exceeds the best 
solution known. Hence, quite often, unprofitable searches are terminated 
very high in the recursion. 

2 To select the branching column, we use the following heuristics. First, 
form the union over all of the rows in the maximal independent set of 
rows. At least one of these columns must be in the selected set; hence, 
we limit the selection of a branching column to one of these columns. 
Second, assign each nonzero element of each row a weight equal to the 
reciprocal of the number of nonzero elements in the row. Sum the weight 
for each column, and choose as a branching column the column of maxi- 
mum weight which is also in the union over the maximal independent set 
of rows. This weighting strategy gives the elements of the smaller sets a 
higher weight, as they are viewed as “harder” to cover. Another reason 
for favoring an element from a smaller set (for example, a set with two el- 
ements) is to create more essential elements in the subsequent recursion. 

4. Results 

A program implementing this minimization algorithm, Espresso-EXACT, has 
been compared against another exact minimization program McBoole [2], and 
against the heuristic minimization program Espresso-MV [10] for a large set 
pla’s drawn from industrial designs (111 pla’s), and a smaller set of standard 
mathematical functions (23 pla’s). 

With a test set so large, it is a challenge to present the results in a meaningful 
manner. Measuring either the number of inputs and outputs, or the number of 
product terms is a notoriously inaccurate measure of the complexity of Boolean 
minimiz ation for a specific example. Therefore, we first classify each of the 134 
min imization problems as shown in Figure 1. 

The classifications were determined by allowing Espresso-EXACT and Mc- 
Boole to run a maximum of 5 hours for each example on an Apollo DN660. 



Logic Synthesis 



213 



Classification 


Description 


trivial 


minimum solution consists of essential prime impli- 
cants 


noncyclic 


the covering problem contains no cyclic constraints 


cyclic and solved 


the covering problem contains cyclic constraints and the 
minimum solution is known 


cyclic and unsolved 


the covering problem contains cyclic constraints but the 
minimum solution is unknown 


too many primes 


there were too many primes to be enumerated 



Figure 1. Problem Classification. 



By examining the results for each program, we arrive at the classification. If 
the problem was solved by either of the two exact minimization algorithms, it 
is easy to classify it as trivial, noncyclic, or cyclic and solved. An example is 
classified as too many primes only if neither program was able to enumerate 
the set of prime implicants, and an example is classified as cyclic and unsolved 
only if neither program was able to complete the covering program (after having 
generated the set of all prime implicants). 

4.1 Comparison of Exact Minimization 
Algorithms 

Figure 2 summarizes the comparison between Espresso-EXACT and McBoole 
for the 134 PLA’s in the test set. Number primes refers to the number of ex- 
amples for which each program could enumerate all of the prime implicants. 
number solved refers to the number of examples each program could solve, time 
gives the time on an Apollo DN660 taken to solve those examples which could 
be solved within the 5 hour limit. 



type 


total 


Espresso-Exact 
number number time 

primes solved (sec) 


number 

primes 


McBoole 

number 

solved 


time 

(sec) 


trivial 


9 


9 


9 


120 


9 


9 


271 


noncyclic 


56 


55 


54 


26,524 


56 


56 


35,956 


cyclic-s 


42 


42 


41 


41,330 


42 


21 


11,241 


cyclic-u 


10 


7 


0 




10 


0 




primes 


17 


0 


0 




0 


0 




Totals 


134 


113 


104 


67,974 


117 


86 


47,468 



Figure 2. Comparison between Espresso-EXACT and McBoole. 

For examples with no cyclic constraints (i.e., trivial or noncyclic), both 
Espresso-EXACT and McBoole are usually able to find the minimum solution. 
However, when there are cyclic constraints, the covering algorithm of Espresso- 
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EXACT is able to find the minimum solution for many more of the pla’s than 
McBoole. Sometimes the results are quite dramatic. For example, McBoole 
was allowed to run 58 hours on one example without terminating; however, 
Espresso-EXACT is able to complete this example in only 100 seconds. 

The number of prime implicants ranged from 6 to 9179, and 36 of the exam- 
ples had more than 1,000 prime implicants. In each of the cases where McBoole 
was able to finish enumerating all primes and Espresso-EXACT was unable to, 
there were more than 5,000 prime implicants. This indicates that the prime gen- 
eration strategy of McBoole is superior when the number of prime implicants 
becomes very large. The execution time for the prime generation algorithms 
varied greatly, although the average time was comparable. 

There were 83 examples which both programs could minimize, 3 examples 
which only McBoole could minimize, 21 examples which only Espresso-EXACT 
could minimize, and 27 examples which neither program was able to minimize. 
For the 83 examples which both programs could minimize, Espresso-EXACT 
used 38,198 seconds, and McBoole used 28,628 seconds. As expected, both 
returned the same number of solution terms. The Espresso-EXACT result had 
51,821 literals and the McBoole result had 53,686 literals, indicating that the 
heuristic algorithm of Espresso-EXACT was more effective at reducing the num- 
ber of literals than the heuristic of McBoole. 

4.2 Espresso-MV Results 

Next, we examine the effectiveness of the heuristic minimization algorithm 
Espresso-MV. To compare the results of Espresso-MV and Espresso-EXACT, we 
consider only those examples for which Espresso-EXACT was able to generate 
the minimum solution. The results are reported in Figure 3. The column number 
minimum reports the number of examples for which Espresso-MV achieved the 
minimum solution. 



type 


number 


Espresso-MV 
number solution 
minimum terms 


time 

(sec) 


Espresso-Exact 
solution time 

terms (sec) 


trivial 


9 


9 


243 


23 


243 


120 


noncyclic 


54 


48 


3,371 


1,366 


3,360 


26,523 


cyclic-s 


41 


19 


3,463 


2,532 


3,395 


41,329 


Totals 


104 


76 


7,077 


3,920 


6,998 


67,973 



Figure 3. Comparison between Espresso-MV and Espresso-EXACT. 

We see that Espresso-MV provides a high quality result for all of the examples 
for which we can generate a minimum solution. In many cases, Espresso-MV 
is providing the minimum solution, and the total difference in terms between 
Espresso-MV and Espresso-EXACT is only one percent. The largest difference 
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for any single example was 7.5%. Not surprisingly, Espresso-MV has the most 
difficulty with the more difficult problems in the test set (i.e., those problems 
with cyclic constraints). 

The comparison shows that Espresso-MV frequently produces results very 
near the minimum solution, and, as expected, the heuristic minimizer is much 
faster than the exact algorithm. However, Espresso-MV cannot guarantee that 
it has achieved the minimum solution. The large increase in time for the exact 
minimizer is priniarily due to the effort needed to prove that the produced result 
is in fact minimum. 

4.3 Exact Results for Mathematical Functions 

Mathematical functions have long been used as standard benchmarks for min- 
imization algorithms. We report here some new results on the minimum size of 
several mathematical functions. These examples have been examined by other 
exact minimizers without providing the minimum solution [2, 16]. 

The problems involve simple mathematical functions (multipliers mlp3 and 
mlp4, norm function dist4, and a squaring circuit sqr6). They have been solved 
as a multiple-output minimization problem {normal in the table), and using two- 
input decoders {bit-paired in the table). These problems are deceptively small 
and have a small number of prime implicants. The difficultly lies in extracting 
the minimum cover for the prime implicant table. Figure 4 presents the results 
for these examples. Times are reported for a DEC Micro Vax-II running Ultrix 
and are in seconds. Examples marked with ’+’ represent a new result. 



Name 


In/Out 


Primes 


Normal 

Minimum 


Time 


Primes 


Bit-paired 

Minimum 


Time 


mlp3 


6/6 


90 


+ 30 


12 


71 


22 


8 


mlp4 


8/8 


606 


+ 121 


4,700 


561 


+ 83 


119,000 


dist4 


8/5 


401 


120 


100 


493 


+ 94 


11,500 


sqr6 


6/12 


205 


+ 47 


114 


241 


39 


93 



Figure 4- Results on previously unsolved mathematical function problems. 

We note that the best previous results for mlp4 were 123 and 88 for the normal 
and bit-paired cases, respectively. For the other examples marked with ’+’, we 
have verified that the previously known solutions are minimum. 
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IMPROVED LOGIC OPTIMIZATION 
USING GLOBAL-FLOW ANALYSIS 



C. Leonard Berman and Louise H. Trevillyan 
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Abstract 

This paper is concerned with techniques for automatically reducing circuit size and improving 
testability. In an earlier paper [2], we introuced a new method for circuit optimization based on 
ideas of global-flow analysis. In this paper, we describe two extensions to the method. The first is 
a basic improvement in the primary result on which the earlier optimization was based, the second 
extends the applicability of the method to “conditional” optimizations as well. Together, these 
enhancements result in improved performance for the original algorithm, as well as the ability to 
handle designer-specified “don’t cares” and redundancy removal uniformly in the framework of 
a graph-based synthesis system, such as LSS[8]. 



1. Overview 

There are two basic approaches to multi-level design: 1) the algebraic/boolean 
approach [6], and 2) the graph-based approach [4]. Although systems generally 
incoporate ideas from both approaches, it is fair to say that the boolean/algebraic 
method typically represents a function as a directed acyclic graph (dag) whose 
nodes compute arbitrary functions and performs optimizations using factoring 
and two-level minimization on the nodes. On the other hand, the graph-based 
approach represents a function as a dag whose nodes compute simple functions 
and performs optimizations using graph manipulation and data flow algorithms. 
This paper describes a significant enhancement to an optimization which be- 
longs to the conceptual and algorithmic framework of the graph-based synthesis 
approach. We note that there has been progress toward applying these ideas in 
the framework of algebraic synthesis [7]. 

In [2] the authors introduced a new circuit optimization which used ideas 
from global-flow analysis to optimize a circuit by first computing circuit sum- 
mary information and then using the information via the min^cut algorithm to re- 
duce the number of connections in a circuit. This paper describes improvements 
to this method. There are two main contributions. The first is an improvement 
of the main theorem of [2] which guaranteed that the optimization was legal, i.e. 
that it left the function of the circuit unchanged. The second contribution is a 
refinement of the method that allows “conditional optimizations” to be handled 
uniformly. These conditional optimizations include connection-reduction opti- 
mizations which use designer-specified “don’t care” information [5], as well as 
redundancy-removal optimizations. Together, the extensions reported here re- 
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Figure 1. An example of Global-Flow Optimization. 



suit in a significant conceptual, as well as practical, simplification to the logic 
optimization phase of LSS. This phase, which until recently included more than 
a dozen local transformations, can now be described entirely in terms of two 
primitive processes: global-flow methods for boolean logic optimization, and 
jfactoring to satisfy fan-in and fan-out constraints. 

The global-flow method is described in detail in [2, 3]. At a high level, it can 
be described as follows. Iteratively for each net, the procedure determines how 
the terminals of the current net can be rearranged and chooses a legal rearrange- 
ment for implementaion. These legal rearrangements are determined assuming 
the net carries the “controlling value” (e.g. the controlling value of a NOR is 1 
and of a NAND is 0) and reflect the boolean nature of the operators (as opposed 
to the algebraic nature of factoring). These possible rearrangements are deter- 
mined using summary information which consists of assertions concerning the 
state of other nets in the circuit. The correctness of the rearrangements is guar- 
anteed by Theorem 1 of [2] which states that rearrangements of the terminals 
of a net that leaves the functions at its “frontier” unchanged are legal. (These 
concepts will be defined precisely later). 
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Example: We use the circuit in Figure 1 to illustrate throughout. In figure 1(a), 
when the procedure considers signal “i”, it performs deduction under the as- 
sumption that i = 1, since 1 is the controlling value for NOR gates. It derives 
the assertions: n = 0, m = 0,j = 0 (because j = 1 implies i = 0, so i = 1 implies 
j = 0),p = 1, r = 0, s = 0 and q = 0. It then determines that FRONTlER(i) = 
{q = 0,r = 0}. The algorithm then changes the terminals of “i” to produced the 
circuit shown in Figure 1(c). 

Since the optimizations involve rearranging the terminals of the net under 
consideration, it is important to differentiate between assertions which are inde- 
pendent of the precise connections of the current net and those which depend 
on where this net goes. Our first result addresses this distinction. In our ear- 
lier paper, this distinction was captured to some extent in the definition of the 
frontier. In this paper, this distinction is refined through the idea of an assertion 
concerning the circuit being “free”. The earlier theorem is extended to show that 
any rearrangement which leaves the non-free part of the frontier unchanged is 
legal. This results in significant connection reduction in some examples. Note 
that in the above example, the connection of “i” at the source of s is not neces- 
sary because the assertion q = 0 is independent of where “i” is connected. It is 
this type of information which is captured in the notion of a “FREE” assertion. 

The second enhancement to the method results from an improvement in the 
formalism which permits us to reason about individual terminals rather than en- 
tire nets. This does not require any fundamental change to the techniques, but 
it does perment “conditional” optimizations to be accommodate in the same 
theoretical framework. These conditional optimizations include connection- 
reduction optimizations that utilize designer-specified output “don’t care” infor- 
mation (this has not been possible before in graph-based synthesis systems) and 
in redundancy-removal optimizations. Connection-reduction optimizations are 
those 

which leave the function unchanged at the most forward unchanged nodes, while 
redundancy-removal optimizations may change the function at the most forward 
nodes, but the change can not be seen at any output because of the circuit struc- 
ture. 



2. Summary of the Technical Material 
2.1 Terminology and Foundations 

Circuits are represented as dags. For simplicity, we assume the nodes of the 
dag to be NORs, INPUTS or OUTPUTS. We use terms gate, node, signal and 
terminal in the standard way. Since we assume each gate has a single output, we 
will also identify the net with the node which is its source. In what follows C is 
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an arbitrary circuit satisfying the above constraints. 

Definition: 

A = {(c,-,v,) I c'^s are terminals of C and v[s are associated values} 
is called a partial assignment of C. 

We can think of a partial assignment as a set of assertions (or deductions) 
about the state of the circuit. For example, if (c,l) € A, we may say that A con- 
tains the assertion c=l, or equivalently that A asserts that terminal c has value 
1. We will write partial assignments as sets of pairs or sets of assertions inter- 
changeably. Note that a partial assignment, A, determines a set of possible input 
values, i.e. all input choices which result in the assertions of A being true. We 
shall refer to such inputs as input assignments of A. 

Definition: For any partial assignment A of C, let A* denote the partial assign- 
ment consisting of all assertions implied by A. This is called the closure of A. 

Although the results described here are independent of the method used to 
derive these assertions, for the sake of clarity in this paper, we use the following 
recurrences, which we introduced in [2], as our deductive system. Let X(s) = 
{inputs of the source of s}. 

Cio(A) = {(^,0) I 3y € X(5)[(y, 1) 6 Cn(A)]} UA 

O{{s,0)\3y{y,l)eCn{A)[seX{y)]} 

U {(s,0) I 3x[(x, 1) e A, (x,0) e Cio({s, 1)})]} 

U {(^,0) 1 3c on the same net as s, (c,0) € Cio(A)} 

Cn(A) = {{s,l)\3y,{y,0)eCxo{A),seX{y)yteX{y)[tf^s=^ 
(t,0)eC,o(A)]} 

U{(s,l)lVyeX(s)[(y,0)€Cio(A)]} UA 
U {(s, 1) I 3[(x, 1) e A, (x,0) e Coo(((s,0)})]} 

U {(s, 1) I 3c on the same net as s, (c, 1) € C\\ (A)} 



Similarly for Qo and Qi . 

The use of this deductive method results in the closure. A*, being the least 
fixed point of these recurrences. We note that Hachtel et. al. [9] uses related 
deductive methods while Brayton et. al. [7] takes a different approach. 

The definition of partial assignments permits different values to be associated 
with different terminals of a net. Such a partial assignment corresponds to in- 
consistent assertions about the state of the circuit. Inconsistent assignments are 
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Definition: A partial assignment A is said to be compatible in C if whenever 
(c,’,v,) and (cj,Vj) G A^, and cj and c,- refer to terminals of the same net, then 

Vj = Vi. 

The following notation will be useful. For any compatible A, we let C\A be a 
circuit identical to C but with all terminals of A removed, and each terminal of 
A which is the input to a gate replaced by a connection to a new primary input. 
Also, let B(A) = {(c, v) | (c, v) G A'' and there is no path in C from any terminal 
in A to any terminal on the same net as c}. 

Observe that B(A) contains assertions which are dependent on how the sig- 
nals in A are computed but independent of how they are used. This is true be- 
cause any assignment to inputs that results in the terminal- value pairs of A must 
also result in the pairs contained in B(A). In fact, for any input assignment of A, 
if C is changed by rearranging terminals in A (subject to some restrictions), the 
assertions of B(A) still hold. This is because no path exists from A to B(A), so 
there can be no direct dependence of these assertions on the terminals in A. We 
illustrate this with the following example. 

Example: Consider the circuit from Figure 1(a), and assume that the partial 
assignment. A, under consideration sets all terminals of signal “i” to 1. We see 
that A ^ = { (n, 0) , (m, 0) , (y, 0) , (p, 1 ) , (r, 0) , (s, 0) , (qr, 0) } , with each signal- value 
pair replaced by all the appropriate terminal-value pairs. We also see that B(A) 
contains the assertions n=0, m=0, j=0 and p=l, again with signals replaced by 
the appropriate terminals. 

The independence is important and leads us naturally to our main definition. 
Definition; FREE(A) = B(A) ^1^. 

Example: If we continue the earlier example, we see that FREE(A) contains 
q=0 as well as the assertions n=0, m=0, j=0 and p=l which were also in B(A). 

We see intuitively that the assertions in FREE(A) are consistent with A'' and 
also independent of A. Consistency is established by: 

Lemma: If A is compatible then 

{c I {(c,0),(c,l)} C {FREE{A) U A^}} =0. 

Proof: From the definition of the closure operator, we see that for any set of 
assertions, X, and partial assignment. A, C and that B{A)^ C A^. Com- 
bining these observations shows that FREE(A) C A^, and since A is compatible. 
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the result follows. 

Intuitively, the independence of FREE(A) follows from the construction of 
C\A. By replacing connections of terminals of A which are inputs of gates by 
connections of new primary inputs, we prevent the establishment of any asser- 
tions which depend on the precise terminals of A. 

Another idea which we need is that of the frontier of a signal in a set of 
nodes. Intuitively, the frontier is the subset of nodes closest to the outputs. More 
formally, given a signal “i” in C and a partial assignment. A, we define the 
frontier ofi in A, ^ (A,j), as the set of nets, j, for which 

1 (j.O)GA^ 

2 There is a path j j\ -4 72 ...OUTPUT such that for no jk does 
jkeA^ 

3 j is reachable in the circuit from i. 

When we say “frontier of a net”, we are referring to the partial assignment 
which assigns one to all terminals of that net. 

The importance of the sets (A,i) and FREE(A) is illustrated by the follow- 
ing theorem: 

Main Theorem: Let i be a net in C and let A be any partial assignment which 
sets i to 1. Let C’ be identical to C except that terminals of i in A may be 
missing and additional terminals of i may be present. Assume that none of the 
connections of i in C' are to nodes in FREE(A). If in the two circuits the sets 
{x \ X e {Af) f\x ^ FREE {A)} are identical, then the two circuits compute 
the same function. 

This theorem is very similar to the main theorem of [2]. In fact, it is estab- 
lished by showing that, in this case, the hypotheses of our earlier theorem hold. 
In our earlier work, we could guarantee that a rearrangement was legal only if 
the frontier sets in the two circuits were identical; while to apply this theorem, 
we require only that the non-”FREE” part of the frontier sets be identical. Since 
connections must be added to maintain the equivalence of these two sets, we see 
that our new result can, in principle, result in smaller circuits. We have found 
this to be true in practice. 

Example: Continuing the example, we saw earlier that jT (A, j) = {q = 0, r = 0}. 
If we combine this with the value for FREE(A) computed above, we see that 
{x I X e ^(A, j) Ax ^ FREE{A)) = {r = 0}. The above theorem then guaran- 
tees that the circuit shown in Figure 1(b) is equivalent to that shown in Figure 
1(a). Note that in Figure 1(b), signal i has only two terminals. As mentioned 
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earlier, our previous method which did not make use of “FREE” connections 
would result in the circuit shown in Figure 1(c), in which net i has three termi- 
nals. We realize that this is a simple example which could be handled by other 
less sophisticated methods. It is meant only to illustrate the improvement in the 
new optimization. 

As presented here, the method appears to be computationally intensive since, 
to compute the “FREE” connections, we must perform deductions for a new 
circuit. However, we note that a good approximation to the deductions embodied 
in the recurrences can be computed on a signal-by-signal basis. This is done by 
a straightforward use of dataflow propagation techniques [1, 10] which never 
compute information that will be invalidated before it is needed. The running 
time of this procedure is proportional to the product of the size of the controlling 
sets and the average fan-in of the circuit. These techniques enable us to compute 
the “FREE” connections with minimal extra cost during a single graph traversal. 

2.2 Experimental Results 

We performed a number of experiments to evaluate the impact of our main 
theorem on the effectiveness of global-flow optimization. We did this by creat- 
ing two programs, one of which used our main theorem and one of which relied 
on Theorem 1 of [2]. Both programs utilized the theorems through the artifact of 
derived graphs. The method used to construct the derived graphs was somewhat 
different from that described in [2]; however, identical methods were used in 
both programs. In addition, because of the computational constraints alluded to 
above, we did not compute the entire fixed point of the recurrences; rather, we 
weakened the recurrences by dropping the term corresponding to the contrapos- 
itive throughout the experiment. When computing the results based on [2], we 
weakened the recurrences even more by including only the term corresponding 
to forward propagation. The result of these two approximations can only exag- 
gerate the possible benefit due to our main theorem, and therefore, our results 
should be considered an upper bound on the usefulness of this idea. We feel 
strongly that other experiments are needed to evaluate the benefit which might 
be gained by utilizing the contrapositive or even using the entire Fij sets defined 
in [2]. 

Our experiments began with logic which had been run through the high-level 
and AND/OR-level optimizations of LSS and then translated to NORs. (See [4, 
8] for details about these optimizations.) We then ran one of the two programs: 
FREEOPT, which utilized our main theorem, or WEAKOPT, which was based 
on our earlier methods. These were followed by some “tidying up” programs 
which propagate constants, remove common sub-expressions, eliminate double 
negatives, etc.. We did not perform fan-in or fan-out correction. 
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Our most surprising result was that in one of the 15 cases for which compar- 
isons were run, the size of the NOR-level circuit produced by WEAKOPT was 
smaller than that produced by FREEOPT. This must be due to the interaction of 
successive applications of the optimization, and suggests that the effect of signal 
ordering in the sequences of applications is important. Currently, we treat high 
fan-out signals first. 

Other than this case, results were in accordance with our expectations. In 
many cases the two methods were identical. In those where there was a differ- 
ence, there was a wide spread in effect; the improvement varied from 25% to 
100% in the ratio of the number of connections removed by each program. This 
large variance is not too surprising; it suggests that if the style of specification 
is such that the main theorem applies, it may apply frequently. 

2.3 Other Applications 

As mentioned earlier, the refinement in our formalism extends the applica- 
bility of our method to conditional optimizations. This permits us to make us 
of “don’t care” conditions directly. We do this by adding a new function which 
recognizes the appropriate “care” set to the circuit and a new pair setting the 
output of this function to ’1’ to partial assignments used for optimization. The 
resulting optimizations will be valid on all inputs in the “care” set. Both output 
“don’t cares” and redundancy removal can be accommodated in this network. 

3. Summary and Conclusion 

In this paper we describe two enhancements to our earlier global-flow algo- 
rithm for connection reduction. We show how to make better use of all types 
of “don’t care” information, and we show how to apply our methods to redun- 
dancy removal optimizations. From a practical standpoint, these enhancements 
result in superior performance for the algorithm. From a conceptual standpoint, 
they unify a wide variety of optimizations which have been part of the Logic 
Synthesis System. 

The work presented suggests a number of avenues for further research: 

1 Develop an incremental or on-line algorithm which maintains the full con- 
trolling sets 

2 Determine the effect of using the forcing sets or the full controlling sets 
in global flow (See [9] for a beginning) 

3 Investigate the effect of signal ordering on global flow. 
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Abstract 

This paper describes efficient algorithms for decomposition and factorization of Boolean expres- 
sions. The method uses only two-literal single-cube divisors and double-cube divisors considered 
concurrently with their complementary expressions. We demonstrate that these objects, despite 
their simplicity, provide a very good framework to reason about common algebraic divisors and 
the duality relations between expressions. The algorithm has been implemented and excellent 
results on several benchmark circuits illustrate its efficiency and effectiveness. 



1. Introduction 

Multi-level logic synthesis is one of the key problems in automatic synthesis 
of digital circuits [1, 2, 4, 6, 8]. The basic operations involved in multi-level syn- 
thesis are decomposition and factoring. Decomposition [6] of a network refers 
to identifying subexpressions common to one or more functions such that they 
can be implemented once and shared across the entire design. Factoring refers 
to representing a logic function in a minimum form using parenthesis with the 
least number of literals. The operation of decomposition is similar to factoring 
except that each subexpression is formed as a new intermediate variable and 
substituted into the functions being decomposed. Both the decomposition and 
factoring concern with identification of common subexpressions and rewriting 
logic functions in factored form. A factored form is a parenthesized algebraic 
expression that has many attractive properties [6]. 

The basic problem in decomposition and factoring is the identification of fre- 
quently used common subexpressions. Sharing of such expressions across the 
entire design reduces the complexity of the synthesized network. Dietmeyer and 
Su [7] proposed a recursive top down factoring algorithm: Given F, the method 
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computes common factors which are single-cube divisors. The main disadvan- 
tage of this method is that common factors which consist of more than one cube 
(multiple cube divisors) are not considered. 

Brayton and McMullen [5, 6] proposed a bottom up technique for detecting 
common subexpressions. They introduced the notion of kernels in algebraic ex- 
pressions and showed how to use kernels to find multiple cube factors, which 
are common to two or more expressions. These techniques, with minor modi- 
fications, are used in all the systems [1, 2, 4, 6, 8] for global optimization. It 
is believed that through the intersection of sets of kernels, it is possible to find 
common subexpressions that would be just as good a group of candidates for 
divisors, as those obtained from intersecting the complete sets of all algebraic 
divisors. Although the set of kernels of an expression is usually smaller than 
the set of all its algebraic divisors, the number of kernels of an expression can 
grow exponentially in the number of literals in the support of the expression [6]. 
The computation to find the intersections of all kernels is not trivial, since the 
problem of detecting intersections in a set of kernels, and detecting common 
subcubes in a set of functions, are computationally equivalent to finding the ker- 
nels of an expression. Due to these facts, the advantage of finding common 
multiple cubes by the intersection of kernels, is nullified in many big practical 
examples. The algorithm [6] is greedy and selects a kernel with greatest value, 
and the kernel is substituted in the network. However, after one kernel is se- 
lected and substituted, it is possible that the value of the remaining kernels may 
change. The recomputation of kernels after every substitution is very expensive. 
Hence the method uses the set of kernels up to a certain point, where the value 
of kernels become inaccurate. This simplification does not guarantee that the 
best kernel is always selected and therefore the method is not entirely greedy. 

Our approach is to identify and extract useful multiple-cube divisors and 
single-cube divisors concurrently with their complements, that together provide 
the greatest cost reduction in terms of the number of literals. In order to make 
the extraction process very simple, and to ensure that the operations are in the 
polynomial time domain, the method during any iteration looks at objects which 
are either multiple-cube divisors having exactly two cubes called double-cube 
divisors or single-cube divisors having exactly two literals. Using objects of 
size two, whose numbers are always in the polynomial domain, algebraic di- 
visors of arbitrary size and their complements are obtained. The relationships 
between double-cube divisors, single-cube divisors, algebraic divisors and their 
complements are studied in this paper. These properties are exploited in design- 
ing efficient concurrent decomposition algorithm. 

Based on the above concepts a program called Pendulum has been imple- 
mented. Excellent results on several benchmark circuits [3, 10] illustrate the 
effectiveness and efficiency of our algorithm in terms of the literal count of the 
synthesized multi-level logic and the CPU time required by the algorithm. 
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2. Definitions 

The definitions used in this paper are consistent with [5]. A variable is a 
symbol representing a single coordinate of Boolean space. A literal is a variable 
or its negation. A cube is a set C of literals such that x eC implies x ^C. 
A cube represents the conjunction of the literals. A sum of products (SOP) 
form, also called an expression, is a set of cubes. An expression represents the 
disjunction of its cubes. Given two Boolean expressions / and g, g is called an 
algebraic divisor of / if f — gh + r, where h and r are expressions and h is not 
null. Also sup{g) n sup{h) = (|). If g has exactly one cube, then g is called a 
single-cube divisor. If g has more than one cube, then g is called a multiple- 
cube divisor. For example, if / = abc + abd + p, g = ab is called a single-cube 
divisor and g = bc-\-bdis called a multiple-cube divisor. A Boolean expression 
/ is cube-free, if the only cube dividing / evenly is 1. Note that a cube-free 
expression must have more than one cube. For example, ab-^c is cube-free 
but not ab - 1 - ac and abc. Double-cube divisors of a Boolean expression / are 
cube-free, multiple-cube divisors having exactly two cubes. 

The set of all double-cube divisors of / is written as D{f), where 

D{f) = {d\d = {ci\{cincj),cj\{cif]cj)}} (1) 

for j, y = 1 , . . . , n, i 7 ^ j, where n is the number of cubes in /. 

For example, let / = ode + ag-\- bcde -t- beg. The double-cube divisors are 
de-\-g, a -{■ be, ode -f beg and ag -I- bcde. It is evident that the total number of 
double-cube divisors for a Boolean function with n cubes is 0{n^). (ciDcj) is 
called the base of cubes c/ and cj and the double-cube divisor definition allows 
double-cube divisors with an empty base. 

3. Properties of double-cube divisors 

As the name indicates, double-cube divisors have exactly two cubes. A subset 
of double-cube divisors is represented by Dx,y,s, where x is the number of literals 
in the first cube, y is the number of literals in the second cube and s is the number 
of variables in the support of any double-cube divisor d. Note that max(x,y) < 
s < (jc+y)- For any d e Dx,y,s, d € Dy^^ and hence without loss of generality, 
we can assume jc < y. For example, a double-cube divisor xy-\-yzp € T> 2 , 3 , 4 - 

Let 5 denote the set of all single-cube divisors. A subset of single-cube divi- 
sors is denoted by Sk, where k is the number of literals in the single-cube divisor. 
For example, a single-cube divisor ab& Si. 

With this notation, some useful relations between single-cube divisors, double- 
cube divisors and complements among them are given below. We assume, in the 
discussion that follows, that double-cube divisors are extracted from functions 
which are prime and irredundant with respect to every output. For proofs, read- 
ers are referred to [9]. 
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Lemma 1 . and Z)i,2,2 are null set. 

Lemma 2 . For any d € d € 82- 
Example. The complement of a + b isdb e S2. 

Lemma 3 . For any d G l?i,2,3. d^D. 

Example. a + bc £ £>1,2,3- Its complement is db + dc. Since D is a set of 
cube-free divisors, db + dc cannot be in D. 

Lemma 4 . For any d G £>2,2, 2 > d is either exclusive-or or exclusive-nor 
expression and d G £>2,2,2- 

Example. cib + bde 02,2,2- Its complement is ab + dbe £>2,2,2- 

Lemma 5 . For any d G £>2,2,3. d G £>2,2,3- 

Example. db + ace 02,2,3 - Its complement is db + ace £>2,2,3 - 

Lemma 6 . For any d G £>2,2,4. d^O. 

Given two Boolean expressions / and g, the important task during decompo- 
sition is to establish whether (a) function / has a complement cube divisor in g 
and (b) function / has a common cube-divisor in g. Theorems 1 and 2 , below, 
establish such relations between two expressions. 

Theorem 1 [ 9 ]. Let / and g be two Boolean expressions. Then if 

■ (a) di 7^ Sj for every J,- G £>1,1,2 (/). sj G 52(g). and 

■ (b) di ^dj for every di G Oexorif) Q £> 2 , 2 , 2 (/). dj G Oexnorig) Q £>2, 2, 2(g). 

and 

■ (c) di^dj for every di G Oexnor if) Q £>2,2,2 (/) ,dj€ Oexor (g) Q £>2,2,2 (g) . 
and 

■ (d) di ^ d j for every di G 02,2,3 (/). dj G 02,2,3 (g). and 

■ (e) di 7^ Sj for every di G £>i,i,2(g). sj G 52(/). 

then / has neither a complement double-cube divisor nor a complement single 
cube-divisor in g. 

One of the important tasks during the decomposition is to detect whether 
two or more expressions have any common algebraic divisors other than single 
cubes. A theorem was presented in [ 11 , 5 , 6 ] using the concept of intersection 
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of set of kernels. This theorem is used for detecting if two or more expressions 
have any common algebraic divisors other than single cubes. The method is 
to compute the set of kernels for each logic expression, and forming nontrivial 
(more than one term) intersections among kernels from different functions. If 
this intersection set is empty, it is certain that there are no common multiple 
cubes between expressions. This theorem is important, since the set of all ker- 
nels is much smaller than the set of all algebraic divisors. We will now demon- 
strate that the same argument is valid for detecting common multiple cube from 
a set of double-cube divisors rather than a set of kernels. This should lead to an 
improved run time efficiency, since the set of all double-cube divisors is much 
smaller than the set of all kernels. 

Theorem 2. Expressions / and g have a common multiple-cube divisors 
if and only if D{f)nD{g) 7 ^ 0. 

Proof. If If D{f) n D{g) ^ 0, then there exists ad e D{f) n D{g) which is 
a double-cube divisor that divides both / and g. 

Only if Now suppose, that / and g have a common multiple-cube divisor 
C, C\f, C\g. (C|/ denotes C divides /.) Let C = {ci,C2,...,Cm}. Take any 
e = {ci,Cj} such that Ci,cj G C. If e is cube-free, then e G D{f)C\D{g). If e 
is not cube-free, then e' = (c,- \ (q D cj),Cj \ (q n c^)} exists since / and g are 
non-redundant. Hence e' G D{f)DD{g). Q.E.D 

4. Concurrent Decomposition Procedure 

The basic objects used in the decomposition of Boolean equations are single- 
cube divisors having exactly two literals, double-cube divisors, and their com- 
plements. These two objects help us to find single-cube divisors of arbitrary 
sizes, and multiple-cube divisors. With each of the generated double-cube divi- 
sors and single-cube divisors, a weight function is associated which indicates the 
number of literals that can be saved, if this divisor is selected. The complement 
of the double-cube divisor is also considered in the weight computation. The 
concurrent, greedy decomposition method is given in Algorithm 1 [9], which 
identifies and extracts either a double-cube divisor jointly with its dual expres- 
sion, or a single-cube divisor that provides the greatest cost reduction in terms 
of the total number of literals. 

5. Experimental Results 

Based on the above proposed algorithm a program called Pendulum has been 
implemented in C under UNIX. The largest PLAs from [10, 3] were selected as 
benchmark circuits to measure the performance of the algorithm. All the PLA’s 
were minimized using the logic minimizer ESPRESSO [3] before decomposi- 
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tion. The resulting synthesized circuits were verified using the verify option of 
MISII [6]. Columns 2 and 3 of Table 1 summarizes the results obtained using 
Pendulum and compares to those generated with MISII (both run on the same 
SUN 3/260) using the script given in Figure 1. 

time 
sweep 
eliminate 5 

simplify -m nocomp -d 

resub -a 

gkx 

resub -a; sweep 
gcx 

resub -a; sweep 
gkx 

resub -a; sweep 
gcx 

resub -a; sweep 
gkx 

resub -a; sweep 
gcx 

resub -a; sweep 
eliminate 0 
decomp -g * 
print-stats -f 
time 



Figure 1. 

Note that only “gkx” and “gcx” commands are used to run MIS without any 
expensive options. These scripts ensure that only level-0 kernels [6] are used 
and the more efficient “ping-pong” algorithm [1 1] is used to find a good (but not 
necessarily the best) kernel intersection and single-cube divisor. These options, 
in MIS, take less CPU time compared to the standard script which finds all the 
kernels and then chooses the best kernel. The effectiveness of the method can be 
summarized by the total literal count and the total time taken to synthesize the 
circuits. These numbers are 8,871 literals and 6,076 CPU seconds for Pendulum 
and 11,488 literals and 53,648 CPU seconds for MISII. For these benchmarks. 
Pendulum synthesized circuits with, on the average, 20 percentage fewer liter- 
als and using one ninth of CPU time required by MISII. The results for MIS2.1 
were reported in the MCNC 1989 workshop poster session [10] and are given 
in the last two columns of Table 1. Single is the factored form literal count us- 
ing a single execution of a standard script. Best is the best factored form literal 
count using several executions of a standard script. It can be seen that for large 
benchmark circuits like apexl, apexS, apex4, apexS and seq Pendulum produces 
smaller networks than that of the best results obtained by MIS and for 9sym, 
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Final literals 
in SOP 


CPU seconds 
SUN 3/260 


Literal count in 
factored-form 


pla 


Pend 

ulum 


MIS2.P 


Pend 

ulum 


M1S2.1^ 


Pend 

ulum 


MIS2.1 

Single^^ 


M1S2.1 

Best^^ 


5xpl 


163 


186 


1.7 


32.0 


150 


114 


104 


9sym 


263 


255 


95.5 


484.7 


225 


229 


216 


add6 


85 


104 


220 


1087 


83 


- 


- 


apexl 


1101 


1268 


75 


990 


1065 


1247 


1247 


apex2 


452 


540 


4498 


37509 


426 


278 


246 


apex3 


1381 


1876 


184.5 


2410 


1380 


1450 


1401 


apex4 


1976 


2594 


346.9 


4512 


1924 


2592 


2592 


apex5 


893 


1112 


183 


507 


849 


908 


890 


chkn 


469 


452 


184.2 


289 


408 


- 


- 


duke2 


435 


493 


13.0 


82.5 


420 


442 


393 


e64 


253 


253 


2.7 


140.3 


253 


253 


253 


rd73 


106 


117 


11.8 


294.8 


98 


90 


74 


rd84 


143 


446 


136.6 


784.6 


135 


148 


124 


sao2 


171 


213 


4.9 


40 


149 


122 


118 


seq 


948 


1547 


63.8 


2059 


896 


1176 


1176 


xor9 


32 


32 


55 


2427 


32 


- 


- 


TOTAL 


8871 


11488 


6076.6 


53648.9 









* Results obtained by using the script given in Figure 1 
♦♦Results reported in MCNC ’89 workshop poster session 
- Results not available 



Table 1. Summary of experimental results. 



duke2, e64 and rd.84 Pendulum produces smaller networks than that of the sin- 
gle execution of MIS. The effectiveness of our approach is due to the fact that 
Pendulum works with divisors and their complements which are in polynomial 
time domain. These divisors are extracted only once and updated incrementally 
as the synthesis proceeds. The duality relation between various intermediate 
nodes is also determined immediately during the synthesis. The advantage of 
the method is the speed over the MIS approach and the ability to compute the 
weights of potential divisors more accurately and dynamically during decompo- 
sition and factorization. Either single-cubes having two literals or double-cube 
divisors are selected at each step, which provides an added advantage over per- 
forming the “gcx” and “gkx” commands separately as done in MIS. As the cube 
divisors are extracted concurrently based on the cost reduction they yield, no 
resubstitution or simplification is required. 

6. Conclusions 

We presented a new, very efficient method for decomposition and factoriza- 
tion of Boolean expressions. The method proposed uses only two-literal single- 
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cube divisors or double-cube divisors concurrently with consideration for use 
of complements. The advantage of this approach is the speed over the methods 
based on kernels [5] and the ability to compute the weights of potential divisors 
more accurately and dynamically during decomposition and factorization. Ei- 
ther single-cubes having two literals or double-cube divisors are selected at each 
step, which provides an added advantage over extracting multiple-cube divisors 
and single-cube divisors separately as done in [6]. In fact, the results obtained by 
this method match the best known literal counts for most benchmark circuits, in 
many cases the method generates Boolean networks with much smaller number 
of literals. In another paper [9], we demonstrate, both theoretically and experi- 
mentally, that the decomposition and factorization transformations introduced in 
this paper preserve testability, which implies that a complete test set developed 
for an input network gives also complete coverage of faults in the synthesized 
multi-level network. 
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Abstract 

In this paper we present a polynomial time technology mapping algorithm, called Flow-Map, 
that optimally solves the LUT based FPGA technology mapping problem for depth minimization 
for general Boolean networks. This theoretical breakthrough makes a sharp contrast with the fact 
that conventional technology mapping problem in library based designs is NP-hard. A key step in 
Flow-Map is to compute a minimum height AT-feasible cut in a network, solved by network flow 
computation. Our algorithm also effectively minimizes the number of LUTs by maximizing the 
volume of each cut and by several postprocessing operations. We tested the Flow-Map algorithm 
on a set of benchmarks and achieved reductions on both the network depth and the number of 
LUTs in mapping solutions as compared with previous algorithms. 



1. Introduction 

The short design cycle and low manufacturing cost have made FPGA an im- 
portant technology for VLSI ASIC designs. The LUT-based FPGA is a popular 
architecture used by several FPGA manufacturers, including Xilinx and AT&T 
[1, 2]. In an LUT-based FPGA chip, the basic programmable logic block is a K- 
input lookup table (X-LUT) which can implement any Boolean function of up 
to K variables. The technology mapping problem in LUT-based FPGA designs 
is to transform a general Boolean network (obtained by technology independent 
synthesis) into a functionally equivalent network of X-LUTs. This paper studies 
the LUT-based FPGA technology mapping problem for delay optimization. 

The previous LUT-based FPGA mapping algorithms can be roughly divided 
into three classes. The algorithms in the first class emphasize on minimizing the 
number of LUTs in the mapping solutions [3, 4, 5, 6, 7, 8]; The algorithms in 
the second class emphasize on minimizing the delay of the mapping solutions 
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[9, 10, 11]. The algorithms in the third class maximize the routability of the 
mapping solutions [12, 13]. Although many of the existing mapping methods 
showed encouraging results, these methods are heuristic in nature, and there is 
no way to determine how far away the mapping solutions of these algorithms 
are from the optimal solution in terms of the number of LUTs or the depth of 
the LUT network. 

This paper presents a theoretical breakthrough which shows that the LUT- 
based FPGA technology mapping problem for depth minimization can be solved 
optimally in polynomial time for general Boolean networks. A key step in our 
algorithm is to compute a minimum height K-feasible cut in a network, which 
is solved optimally in polynomial time based on efficient network flow com- 
putation. Our result makes a sharp contrast with the fact that the conventional 
technology mapping problem in library-based designs is NP-hard for general 
Boolean networks [14, 15]. Due to this inherent difficulty, most conventional 
technology mapping algorithms decompose the input network into a forest of 
trees and then map each tree optimally [14, 15]. Such a methodology was also 
used in some existing FPGA mapping algorithms [3, 16]. However, our result 
shows that optimal solutions can be produced efficiently for general Boolean 
networks in LUT-based FPGA technology mapping for depth minimization. 

2. Problem Formulation and Preliminaries 

A Boolean network can be represented as a directed acyclic graph (DAG) 
where each node represents a logic gate, and a directed edge (i, j) exists if the 
output of gate i is an input of gate j. A primary input (PI) node has no incoming 
edge and a primary output (PO) node has no outgoing edge. We use input{v) to 
denote the set of fanins of gate v. Given a subgraph H of the Boolean network, 
input{H) denotes the set of distinct nodes outside H which supply inputs to the 
gates in H. For a node v in the network, a K-feasible cone at v, denoted Cv, is 
a subgraph consisting of v and its predecessors such that any path connecting a 
node in Cv and v lies entirely in Cv, and | input{Cv) |< K. The level of a node v 
is the length of the longest path from any PI node to v. The level of a PI node is 
zero. The depth of a network is the largest node level in the network. A Boolean 
network is K-bounded if | input{v) |< for every node v. 

We assume that each programmable logic block in an FPGA is a ^-LUT 
that can implement any ^T-input Boolean function. Thus, each ^-LUT can im- 
plement any ^-feasible cone of a Boolean network. The technology mapping 
problem for A:-LUT based FPGAs is to cover a given Boolean network with 
X'-feasible cones. (Note that we allow these cones to overlap, which means 
that certain nodes in the original network can be duplicated when generating 
^T-LUTs.) A technology mapping solution 5 is a DAG where each node is a 
^-feasible cone (equivalently, a /(T-LUT) and the edge (C„,Cv) exists if u is in 
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input{Cv). Our main objective is to compute a mapping solution that results in 
the minimum delay. The unit delay model is used where the delay is determined 
by the depth of the mapping solution. We say that a mapping solution is opti- 
mal if its depth is minimum. The main objective of our algorithm is to find an 
optimal mapping solution; the secondary objective is to reduce the number of 
AT-LUTs used in the solution. 

Several important concepts about cuts in a network will be used in this paper. 
Given a network N = {V{N),E{N)) with a source 5 and a sink t, a cut (X,X) 
is a partition of the nodes in V such that s eX and t eX. The node cut-size 
of (X,X), denoted as n{X,X), is the number of nodes in X that are adjacent to 
some node in X, i.e. 

n(X,X) =1 {x: (x,y) G E,xeX andy GX} | . 

A cut (X,X) is K-feasible if its node cut-size is no more than K, i.e., n{X,X) < 
K. Assume that each edge (m, v) has a non-negative capacity c(m, v). Then, the 
edge cut-size of (X,X), denoted e(X,X), is the sum of the capacities of the edges 
that cross the cut, i.e. 

e{Xj)= ^_c{u,v). 
uex,vex 

Throughout this paper, we assume that the capacity of each edge is one unless 
specified explicitly. The volume of a cut (X,X), denoted vo/(X,X), is the num- 
ber of nodes in X, i.e., vo/(X,X) =| X |. Moreover, assume that there is a given 
label /(v) associated with each node v. Then, the height of a cut (X,X), denoted 
/i(X,X), is defined to be the maximum label in X, i.e. 

h(X,X) = MAX{l{x) :xex}. 

Figure 1 shows a cut (X,X) in a network with given node labels, where n{X,X) — 
3, e{X,X) = 10, h{X,X) = 2, and vol{X,X) = 9. 

3. An Optimal LUT-Based FPGA Mapping 
Algorithm for Depth Minimization 

Our algorithm is applicable to any K-bounded Boolean network. Given a 
general Boolean network as input, if it is not X-bounded, there are a number of 
ways to transform it into a X-bounded network. For example, the Roth-Karp 
decomposition [17] was used in [5] to obtain a X-bounded network. In our 
system, we first transform the given Boolean network into a network of simple 
gates (i.e. AND, OR, NAND, and NOR gate): We represent each complex gate 
in the sum-of-products form and then replace it with two levels of simple gates. 
Then, we transform the resulting Boolean network into a two-input Boolean net- 
work. There are two reasons for carrying out such a transformation. First, we 
want to limit the number of inputs of each gate to be no more than X so that 
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we do not have to decompose gates during technology mapping. Second, if we 
think of FPGA technology mapping as a process of packing gates in a given net- 
work into ^-LUTs, then, intuitively, smaller gates will be more easily packed, 
with less wasted space in each AT-LUT. As proposed in [9, 18, 19], we use an 
algorithm based on the Huffman coding tree construction to decompose each 
multiple input simple gate into a tree of two-input simple gates. According to 
the result in [9], such a decomposition procedure increases the network depth 
by at most a small constant factor. Although our system transforms the original 
network into a network of two-input simple gates, the optimality of our algo- 
rithm does not depend on the fact that each node in the Boolean network is a 
two-input simple gate. The optimality of our mapping result holds as long as 
the input network is AT-bounded, in which the gates need not to be simple. 

Our optimal mapping algorithm, named Flow-Map, runs in two phases. In 
the first phase, it computes a label for each node which reflects the level of 
the AT-LUT implementing that node in an optimal mapping solution. In the 
second phase, it generates the AT-LUT mapping solution based on the node labels 
computed in the first phase. Due to the length restriction of the paper, the results 
in this section and next section are stated without proof. The proofs of these 
results can be found in [20]. 

3.1 The Labeling Phase 

Given a AT-bounded Boolean network N, let Ny denote the subnetwork con- 
sisting of node v and all the predecessors of v. We define the label of v, denoted 
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as /(v), to be the depth of the optimal AT-LUT mapping solution of Ny. Clearly, 
the level of the AT-LUT containing v in the optimal mapping solution of N is at 
least /(v), and the maximum label of all the POs of N is the depth of the optimal 
mapping solution of N. The first phase of our algorithm computes the labels 
of all the nodes in N, according to the topological order starting from the Pis. 
For each PI node v, we assign /(v) = 0. Suppose t is the current node being 
processed. Then, for each node v in Ni, the label /(v) has been computed. 
By including in A/, an auxiliary node s and connecting s to all the PI nodes in 
N,, we obtain a network with s as the source and t as the sink. For simplicity we 
still denote it as Nt. Figure 2(a) shows part of a Boolean network in which gate 
t is being labeled, and Figure 2(b) shows the construction of the network Nt. 
Let WT{t) be the AT-LUT that implements node t in an optimal AT-LUT map- 
ping solution of Nt, and let X denote the set of nodes in JJJT{t) and X denote 
the remaining nodes in A/,. It is easy to see that (X,X) forms a AT-feasible cut 
between s and t in Nt because the number of inputs of LUT{t) is no more than 
K. Moreover, let u be the node with the maximum label in X, then, the level of 
LUT{t) is l{u) -f- 1 in the optimal mapping solution of Nf Recall the definition 
of the height of a cut in Section 2, we have h{X,X) = l{u). Therefore, in order 
to minimize the level of LUT{t) in the mapping solution of N, we want to find a 
minimum height K-feasible cut (X,X) in Nt- (We exclude the cuts (X,X) where 
X contains a PI node. Our algorithm to be shown later on guarantees that such 
kind of cuts are not generated.) In other words, 

i(t) = 

Figures 2(b) and 2(c) illustrate our labeling method. Since in 2(b) we have a 
minimum height 3-feasible cut in N whose height is 1, t is labeled 2, and the 
optimal AT-LUT mapping solution of N is shown in Figure 2(c). 

There was no known polynomial time algorithm for computing a minimum 
height AT-feasible cut. One important contribution of our work is that we have 
developed an 0(Km) time algorithm for computing a minimum height AT-feasible 
cut in Nt, where m is the number of edges in Nf 

First, we show that the node labels defined by our labeling scheme satisfy the 
following property. 

Lemma 1 The label oft satisfies l{t) = p or l{t) = p-1- L where p is the maxi- 
mum label of the nodes in input(t). 

According to Lemma 1, our algorithm first checks if there is a AT-feasible cut 
(Xt,Xt) of height p — 1 in N- If there is such a cut, we assign l{t) = p and node 
t can be packed with the nodes in X, into the same AT-LUT in the second phase 
of our algorithm. Otherwise, the minimum height of the AT-feasible cuts in A, is 
p and (V(A^/) - {t}, {t}) is such a cut. In this case, we assign l{t) = p-\- 1 and 
we shall use a new AT-LUT for node t. 
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Figure 2. Computing the label /(/) of node t(K=3). (a) The partial network; (b) Construction 
of Nt and the highest 3-feasible cut; (c) Determining l(t). 



Whether Nt has a ^T-feasible cut of height p - 1 or not can be tested efficiently 
using the following method. Let p be the maximum label of the nodes in Nt. We 
first apply a network transformation on Nt that collapses all the nodes in Nt with 
label > p, together with t, into the new sink t'. Let Nt be the resulting network, 
we have the following result. 

Lemma 2 Nt has a K-feasible cut of height p — I if and only if N't has a K- 
feasible cut. 

For example, Figure 3(a) shows the network Nt for node t in Figure 2(a), and 
Figure 3(b) shows the induced network Nf 

In order to determine if A// has a AT-feasible cut, we apply another network 
transformation, which reduces the node cut-size constraint to an edge cut-size 
constraint by splitting nodes into edges. Specifically, we construct a new net- 
work N'/ from N'l as follows. For each node v in A// other than s and t', we in- 
troduce two nodes vi and V 2 and connect them by an edge (vi, V 2 ) in N'/, which 
is called a bridging edge. The source s and sink t' are also included in N^' (with 
t' renamed as t"). For each edge (s, v) in A(/, there is an edge ( 5 , vi) in N'/) and 
for each edge {v,t') in Af/ there is an edge (v 2 ,t") in N". Moreover, for each 
edge (m,v) in Nl (u^ s and v ^ t'), we introduce an edge (« 2 )Vi) in N'/. We 
assign the capacity of each bridging edge to be one, and the capacity of each 
non-bridging edge to be infinity. Figure 3(c) shows the resulting N” obtained 
from Nf in Figure 3(b). Regarding this transformation, we have the following 
fact: 
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Figure 3. Network transformations in computing a minimum height 3-feasible cut in A/j. 



Lemma 3 has a K-feasible cut if and only ifN" has a cut whose edge cut-size 

is no more than K. 

According to the Max-flow Min-cut Theorem [21], N'/ has a cut whose edge 
cut-size is no more than K if and only if the maximum flow between s and t" in 
N[' has a value of K or smaller. We apply the augmenting path algorithm in N" 
to compute a maximum flow. Since each bridging edge in N" has unit capacity, 
each augmenting path in the flow residual graph of N" from s to t" increases the 
flow by one unit. If we can find -I- 1 augmenting paths, then N” has a maximum 
flow of a value larger than K and we can conclude that N” does not have a cut 
{X",X") with e{X",X") < K. Otherwise, the residual graph is disconnected 
before we find the {K -I- l)-th augmenting path, and the disconnected residual 
graph induces a cut of edge cut-size no more than K. Moreover, we can find 
such a cut (X",X") by performing a depth first search starting at the source s, 
and including in X” all the nodes which are reachable from s. Since finding 
an augmenting path takes 0{m) time, where m is the number of edges in the 
residual graph of N" (which is in the same order as the number of edges in A,), 
we can determine in 0{Km) time whether N" has a cut of edge cut-size no more 
than K and find one if such a cut exists. Such a cut (X",X") in N[' induces a 
AT-feasible cut (X',X') in X/, which in turn gives a minimum height X-feasible 
cut (X,X) in A,. (It is clear that for the resulting cut (X,X) in A,, X does not 
contain any PI nodes since any outgoing edge of the source s in N" has infinite 
capacity.) 

Based on the above discussions, we have 
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Theorem 1 A minimum height K-feasible cut in Nt can be found in 0{Km) time 
where m is the number of edges in Nf. 

Applying Theorem 1 to each node in network N in our labeling algorithm, we 
have 

Corollary 1 The labels of all the nodes in N can be computed in 0{Kmn) time, 
where n and m are the number of nodes and edges in N, respectively. 

In the current LUT-based FPGA architecture, the typical value of /iT is 4 or 
5. Moreover, if the number of fanins (or the fanouts) of each node in N is 
bounded by a constant (which is two in our implementation), we have m = 0{n). 
Therefore, the complexity of the labeling phase of our algorithm is 0{n^) in 
practice. 

In fact, the result in Theorem 1 can be generalized to compute the minimum 
height /iT-feasible cut in a general network with arbitrary node labels. Details 
can be found in [20]. 

3.2 The Mapping Phase 

The second phase of our algorithm is to generate the A!^-LUTs in the optimal 
mapping solution. Let L be the set of outputs which are to be implemented using 
/ST-LUTs. Initially, L contains all the PO nodes. We process the nodes in L one by 
one. For each non-PI node v in L, assume that (Xy,Xv) is the minimum height K- 
feasible cut in that we computed in the first phase by the labeling algorithm. 
We generate a ^-LUT v' to implement the function of gate v, using the input 
signals from to X^. That is, the ^-LUT v' includes all the gates in Xy and 
input{v') = input{Xf). (Since the cut is X-feasible, we have | input{Xf) j< K.) 
Then, we update the set L to be (L — {v}) U input{v'). It is possible that a gate w 
belongs to both Xy and X„ for two different gates v and u in L. In this case, gate 
w is automatically replicated and is included in both v' and u'. It is also possible 
that no X-LUT is generated for a gate w since it has been completely covered 
by the X-LUTs generated for some of its successors. In general, a X-LUT has 
to be generated for a gate w if w belongs to input{v') of some AT-LUT v' which 
has been generated. 

The second phase ends when L consists of only PI nodes of the original net- 
work. It is clear that at the end of the execution we get a network of X-LUTs 
which is logically equivalent to the original network. 

Combining the first and second phases, we have 

Theorem 2 For any K-bounded Boolean network N, the Flow-Map algorithm 
produces a K-LUT mapping solution with the minimum depth in 0{Kmn) time, 
where n and m are the number of nodes and edges in N. 
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In our implementation, K = 5 and w = 0(n), so the complexity of the Flow- 
Map algorithm is O(n^) in practice. 

4. Enhancement of the Flow-Map Algorithm for 
Area Optimization 

The secondary objective of our technology mapping algorithm is to minimize 
the number of AT-LUTs in the mapping solution. In Flow-Map, this is consid- 
ered by maximizing the volume of each cut during the mapping process and by 
postprocessing operations for AC-LUT reduction. 

4.1 M^Lx:imizing Cut Volume During Mapping 

From the discussion in the preceding section, for each node t in the input 
network N, the Flow-Map algorithm computes a minimum-height AT-feasible 
cut (X,X) in Nf and the nodes in X will be packed into the same AT-LUT with t 
if a AT-LUT is generated to implement t. In general, minimum-height AT-feasible 
cut is not unique. Intuitively, the larger vo/(Z,X) =[ X ] is, the more nodes we 
can pack into a AT-LUT, and the fewer AT-LUTs we use in total. Therefore, our 
algorithm wants to maximize the volume of the cut that it finds at each node. 

It can be shown that maximizing the volume of a AT-feasible cut (X,X) in A/j 
is equivalent to maximizing the volume of the corresponding cut {X",X") in N" 
[20]. Therefore, according to the algorithm presented in the previous section, we 
want to find a min-cut in N" (i.e. a cut (X",X") with the minimum e{X" ,X")) 
of the maximum volume. (Since for every signal crossing the cut in Nt we need 
to generate the signal using a X-LUT, minimizing the node cut-size in Nt will 
also lead to reduction of the number of X-LUTs. Consequently, we look for a 
min-cut with maximum volume in N", instead of any cut with e{X'\X") < X.) 
First, we can show the following results. 

Lemma 4 There is a unique maximum volume min-cut in any network. More- 
over, if (X,X) is the maximum volume min-cut and (Y,Y) is another min-cut 
different from (X,X), then X CY. 

Lemma 5 Let Rf be the residual graph of a maximum flow f. Let X be the set 
of nodes in Rf reachable from the source s, and X be the set of the remaining 
nodes. Then, (X,X) is a maximum volume min-cut. 

Combining Theorem 1 and these results, we have 

Theorem 3 A maximum volume min-cut in Nj' can be found in 0{Km) time, 
where m is the number of edges in Nf. 

Therefore, Flow-Map maximizes the number of gates covered by each X- 
LUT by maximizing the volume of each min-cut in N”. As a result, area mini- 
mization is also achieved in the depth optimal mapping of Flow-Map. 
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4.2 Postprocessing Operations for K-LUT 
Reduction 

After obtaining a ^-LUT mapping solution using the Flow-Map algorithm, 
we want to further reduce the number of AT-LUTs used in the mapping solution 
without increasing the depth. In [9], two depth-preserving operations were de- 
veloped to minimize the number of AT-LUTs in the mapping solutions of DAG- 
Map. One is called predecessor packing and the other is called gate decompo- 
sition. Although these operations lead to substantial reduction in the number of 
AT-LUTs, they take only local information into consideration during minimiza- 
tion. 




Figure 4> The flow-pack operation {K = 5). 



In this subsection, we introduce a new postprocessing operation called Flow- 
Pack. Given a AT-LUT u in the mapping solution, it tries to pack a set of pre- 
decessors of u (including u), denoted into a single AT-LUT. (See Figure 4.) 
Clearly, we need to guarantee the condition that | input (Pu) |< AT in order to 
carry out the packing. Let M be the current mapping solution and be the 
subnetwork of M consisting of AT-LUT u and all its predecessors. It is easily 
seen that can be packed into a single AT-LUT if and only if (V (M„) — Pu,Pu) 
forms a AT-feasible cut in M„. Moreover, the larger | Pu \ is, the more Af-LUTs 
we reduce in the mapping solution M. Therefore, we want to find a AT-feasible 
cut with the maximum volume in M„. 

Since no polynomial time optimal algorithm is known for the maximum vol- 
ume AT-feasible cut problem in a network, we have developed a heuristic algo- 
rithm to solve this problem. We start with a cut of the minimum node-cut size 
and maximum volume, which can be computed using the algorithm described 
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in the previous subsection. Then, we gradually increase the volume of the cut 
(and the node-cut size will increase accordingly as well) until the node-cut size 
exceeds K. The last AT-feasible cut in the sequence is recorded as an approxi- 
mate solution to the maximum volume A!^-feasible cut problem. The details of 
this algorithm can be found in [20]. The time complexity of the algorithm was 
shown to be 0{K^m). 

Based on this approximation algorithm to the maximum volume ^T-feasible 
cut problem, the Flow-Pack operation is implemented as a part of postprocessing 
step of the Flow-Map package. During the postprocessing phase, we first carry 
out the matching based gate-decomposition operation as described in [9]. Then, 
we apply the Flow-Pack operation to each /f-LUT u in the mapping solution 
so that u is collapsed with a maximal subset of its predecessors into a single 
/s:-LUT. 

The advantage of the Flow-Pack operation is clear: the Flow-Pack operation 
takes the global information about the entire subnetwork into consideration 
during the packing process. Therefore, it leads to more substantial reduction of 
the number of ^-LUTs than the predecessor-packing operation defined in [9]. 
Our experimental results show that on average the postprocessing phase reduced 
the number of /C-LUTs in the mapping solution by 13.0%, and the Flow-Pack 
operation alone reduced the number of ^-LUTs by 1 1.6%. 

5. Experimental Results 

We have implemented the Flow-Map algorithm and its preprocessing and 
postprocessing steps and tested them on a set of MCNC benchmark examples, 
we chose K = 5 and compared our results with those produced by previous algo- 
rithms including Chortle-d [10], DAG-Map [9], and MS-pga delay optimization 
algorithm [11]. 

Table 1 compares the performance of Flow-Map with Chortle-d and DAG- 
Map, using the input networks that were used by Chortle-d [10]. Overall, the 
solutions of Chortle-d used 50.4% more 5-LUTs and had 4.8% larger network 
depth; the solutions of DAG-Map used 8.6% more 5-LUTs and had 2.4% larger 
network depth. Flow-Map always results in the mapping solution of the smallest 
depth, and in most cases uses less number of 5-LUTs. 

We also compared Flow-Map with MlS-pga(delay) in Table 2. The results of 
MlS-pga(delay) are cited from [11] (since we are unable to run their program 
directly). We obtain the results of Flow-Map by first synthesizing the original 
benchmarks using a standard MIS optimization script (used by Chortle-crf [3] 
and DAG-Map [9]) for technology-independent optimization, then applying the 
Flow-Map algorithm for technology mapping. Since MlS-pga(delay) combines 
logic synthesis and technology mapping, in several cases it produced mapping 
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Technology Mapping for 5-LUT FPGAs (I) 




Chortle-d 


DAG-Map 


Flow-Map 




LUTs 


depth 


LUTs 


depth 


LUTs 


depth 


5xpl 


26 


3 


24 


3 


25 


3 


9sym 


63 


5 


61 


5 


61 


5 


9symml 


59 


5 


58 


5 


58 


5 


C499 


382 


6 


207 


5 


154 


5 


C880 


329 


8 


243 


8 


232 


8 


alu2 


227 


9 


169 


8 


162 


8 


alu4 


500 


10 


305 


10 


268 


10 


apex6 


308 


4 


266 


4 


257 


4 


apex? 


108 


4 


91 


4 


89 


4 


count 


91 


4 


81 


4 


76 


3 


des 


2086 


6 


1433 


6 


1308 


5 


duke2 


241 


4 


192 


4 


187 


4 


misexl 


19 


2 


15 


2 


15 


2 


rd84 


61 


4 


43 


4 


43 


4 


rot 


326 


6 


292 


6 


268 


6 


vg2 


55 


4 


46 


4 


45 


4 


z4ml 


25 


3 


17 


3 


13 


3 


total 


4906 


87 


3543 


85 


3261 


83 


cmprsn 


+50.4% 


+4.8% 


-h8.6% 


+2.4% 


1 


1 



Table 1. Comparison with Chortle-d and DAG-Map. 



solutions of smaller depth than those of Flow-Map. However, on average MIS- 
pga(delay) still used 9.8% more 5-LUTs and had 7.1% larger depth. 

Our experiments were carried out on a Sun SPARC IPC workstation (14.8 
MIPS). For each benchmark example, the Flow-Map package took only less 
than a minute of CPU time (a few seconds in most cases) to generate a mapping 
solution. Therefore, it is much faster than Boolean optimization based algo- 
rithms in general. 

6. Future Work 

Currently, we are extending the Flow-Map algorithm to handle more complex 
delay models (such as the nominal delay model [22], which considers both the 
level and the number of fanouts of each node in delay computation). We are also 
studying the trade-off between delay and area in technology mapping for FPGA 
designs. 
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Table 2. Comparison with MIS-pga (delay optimization) algorithm. 
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Abstract 

A problem in technology mapping is that quality of the final implementation depends signifi- 
cantly on the initially provided circuit structure. To resolve this problem, conventional techniques 
iteratively but separately apply technology independent transformations and technology mapping. 

In this paper, we propose a procedure which performs logic decomposition and technology 
mapping simultaneously. We show that the procedure effectively explores all possible algebraic 
decompositions. It finds an optimal tree implementation over all the circuit structures examined, 
while the run time is typically logarithmic in the number of decompositions. 



1. Introduction 

Technology mapping is usually applied on a particular structure (decomposi- 
tion) of logic expressions. Therefore, quality of the final implementation could 
severely depend upon the structure initially provided to the technology mapper. 

The most common approach for generating the initial structure of logic ex- 
pressions for technology mapping consists of two phases [2]. In phase (1), 
logic expressions are optimized in a technology independent manner. The re- 
sult is represented by a graph, which is sometimes called a boolean network [2]. 
Then in phase (2), the resulting expressions are converted into a special type of 
boolean network, where nodes represent particular functions. (In this paper, we 
assume that the resulting network is represented by two-input AMDs (AND2s) 
and inverters. We call such a network AND2/INV network.) Technology map- 
ping is the third and last phase of synthesis, where the AND2/INV network is 
mapped onto a set of library gates. 

Synthesis techniques based on these three phases have an obvious drawback; 
optimizations applied in each of the phases are disconnected. Although deci- 
sions made in the first and the second phases critically affect the final results, it 
is not clear in these phases how the resulting logic expressions are actually im- 
plemented. On the other hand, in technology mapping, real data on constraints 
and library gates are available, but the degree of freedom for the final implemen- 
tation is limited by the decisions made in the preceding phases. This drawback 
is fatal when one needs to handle tight and complicated constraints. 

A variety of techniques have been proposed in the literature to resolve this 
problem [1, 6, 15, 5, 11, 16, 18, 13], most of which iteratively apply the three 
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phases above. Even though these techniques yield some improvements, the ef- 
fect is limited because the disconnection between phases (1) (2) and technology 
mapping is not fully resolved. Data from the initial mapping are quickly inval- 
idated by operations of phases (1) and (2), and cease to be an effective guide. 
Moreover, without information about library gates, one can foresee little about 
the final implementation, and thus transformations are made somewhat blindly. 

The central problem is that technology independent transformations and tech- 
nology mapping do not cooperate. This is the problem addressed in this paper. 
Among technology independent transformations, our focus is on algebraic logic 
decomposition [3], a key operation for changing circuit structures. We asked 
ourselves if it is possible to apply algebraic logic decomposition and technol- 
ogy mapping simultaneously. The answer is yes, and we present a procedure 
which is as effective as applying technology mapping over all possible alge- 
braic decompositions exhaustively. The proposed procedure compactly encodes 
a set of decompositions (structures) into a single graph, and applies a technol- 
ogy mapping algorithm on it. In the meanwhile, it dynamically modifies the 
set of decompositions encoded in the graph by introducing new decompositions 
while deleting others based on the actual cost function used in technology map- 
ping. It is guaranteed that the procedure finds an optimal solution among all the 
decompositions examined. 

The graph for representing a set of decompositions is called a mapping graph. 
It is obtained as the union of AND2/INV networks, where nodes of the networks 
are merged as much as possible. Each AND2/INV network corresponds to a 
single decomposition, which we call an AND2/INV decomposition. State-of- 
the-art technology mapping techniques can be naturally extended for mapping 
graphs, retaining linear run time and optimal results^ . 

A set of AND2/INV decompositions are generated through iterative local 
transformations defined on a mapping graph. We define three such transforma- 
tions: associative transformation, distributive transformation, and inverter trans- 
formation. The associative transformation is based on the associative law, i.e. 
two AND2s representing a{bc) are transformed into two AND2s for (ab)c and 
vice versa. Similarly, the distributive transformation is based on the distributed 
law, and transforms ab + ac to a{b + c). The inverter transformation adds or 
removes two consecutive inverters between two consecutive nodes in a mapping 
graph. These transformations are the vehicle to introduce new decompositions. 

In this paper, we examine two sets of AND2/INV decompositions. The first is 
the closure of the set of AND2/INV decompositions of a given boolean network 
T] under the associative and inverter transformations, which we denote by At,. 
We show that phase (2) of the conventional logic synthesis approach, where a 
boolean network is converted to an AND2/INV decomposition, is completely 
subsumed by At,, i.e. every AND2/INV decomposition of t] is contained in 
At,. Therefore if At, is encoded in a mapping graph and technology mapping is 
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applied, an optimal result over all possible AND2/INV decompositions of r| is 
obtained. This technology mapping procedure is referred to as the A-mapping 
procedure. 

The second set of AND2/INV decompositions we examine is even larger. It 
is the closure of A,^ under the distributive transformation, which we denote by 
A,^. We show that At, subsumes phase (2) as well as a key part of phase (1), 

1. e. algebraic logic decomposition. In other words, no matter how algebraic 
logic decomposition is applied on a given boolean network Ti in phase (1), and 
no matter how the result is converted to an AND2/INV decomposition in phase 
(2), the resulting decomposition is contained in At,. Therefore, the technology 
mapper finds an optimal result over all possible AND2/INV decompositions 
given through all possible algebraic decompositions. We call this procedure A- 
mapping. A-mapping dynamically applies the distributive transformation during 
technology mapping, so that new decompositions are introduced on the fly. It 
also deletes some decompositions, but does so only when it is known that at least 
equally good implementations can be obtained without them, and thus quality 
of the final result is not affected. Therefore, it controls the size of the set of 
encoded decompositions, and effectively explores the entire At,. 

Both the A and A mapping procedures have been implemented. Extensions 
were made to handle sequential circuits, so that logic decompositions across 
latches and retiming possibilities are also taken into account. They are being 
used for commercial design projects, where A-mapping is first used, and then 
A-mapping is applied on timing-critical regions to toher speed up the final im- 
plementations. We show that both procedures are feasible in terms of complexity 
for most practical examples, and demonstrate their effectiveness on benchmark 
examples. 

2. Preliminaries 

We adopt the notion of boolean network and associated terminology from [2]. 
A variable y„ is associated with each primary input or internal node n. A sum- 
of-products expression /„ is associated with each internal node n. 

An internal node n in a boolean network can be replaced by a logically equiv- 
alent collection of two-input AND nodes (AND2s) and inverters. Working from 
the sum-of-products /„, each product term is represented by a tree of AND2s. 
The sum is represented by a tree of AND2s with an inverter at each leaf and at 
the root. The result is called an AND2/INV decomposition of n. An AND2/INV 
decomposition of a boolean network is formed by replacing each internal node 
by an AND2/INV decomposition of that node. 

A boolean network in which every internal node is either an AND2 or inverter 
is called an AND2/INV network. An AND2/ENV decomposition of a boolean 
network is always an AND2/INV network. Since our technology mapping pro- 
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Figure 1. The lower left diagram is a mapping graph and the upper right diagram is an 
AND2/1NV network encoded within. The path of the depth-first search which generated the 
AND/INV network is highlighted on the left. 



cedures are tree-based, whether internal nodes in an AND2/INV network are 
shared or duplicated is irrelevant in this paper. Therefore, it is assumed that all 
AND2/INV networks are converted by duplication to a canonical form in which 
every internal node has exactly one fanout. 

A library is a set of gates, each accompanied by area, delay, and loading 
characteristics and by a single-output boolean network representing its logic 
function. A boolean network in which every internal node is associated with a 
logically equivalent library gate is called a mapped network. 

3. Mapping on a Set of Decompositions 

In this section we introduce two basic tools and use them to build a mapping 
procedure. The first subsection defines the mapping graph data structure, which 
encodes a set of AND2/INV networks. Next, we present the graph-mapping 
algorithm which effectively applies tree-mapping [9] to every AND2/INV net- 
work encoded in a mapping graph. Graph-mapping can be applied to any map- 
ping graph. In the third subsection, we use these two tools to build A-mapping, 
a two-step procedure consisting of one particular way of generating a mapping 
graph from a boolean network followed by a call to graph-mapping. 

3.1 Mapping Graph Data Structure 

Conceptually, a mapping graph is an AND2/INV network with four modifi- 
cations. An example is given in the lower-left portion of Figure 1. 

First, a new type of node is introduced. A choice node b is drawn as an 
OR gate marked with an X. A choice node has an associated variable jb, but 
no associated sum-of-products. All fanins of a choice node must be logically 
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Figure 2. A ugate is a subgraph of a mapping graph. There are three types: (a) primary 
input (b) primary output (c) AND. For primary input and AND ugates, the top half of each is 
the positive half and the bottom half is the negative half. Likewise, the top output is the positive 
output and the bottom is the negative output. The set of AND2s in the positive half of ugate u is 
denoted //„* and the set of AND2s in the negative half is denoted H^. Every AND ugate contains 
at least one AND2, but otherwise there may be any number in each half. 



equivalent. Intuitively, the technology mapper will first view the choice node as 
a wire connecting the first fanin of the choice node to its output, then as a wire 
connecting the second fanin to the output, etc. Therefore, several AND2/INV 
representations of the same boolean function can be supplied to the mapper by 
making the outputs of all the representations into fanins of a common choice 
node. 

Second, a mapping graph may contain directed cycles. A directed cycle spec- 
ifies that a function / may be expressed in terms of g (creating a directed path 
from a node computing g to a node computing /) and also vice versa (completing 
the directed cycle). The technology mapper will consider both implementations. 
A typical example of a directed cycle, consisting of two inverters and two choice 
nodes, appears several places in Figure 1. This cycle structure encodes the no- 
tion of an arbitrary length inverter chain. Cycles are permitted for theoretical 
and implementation convenience. 

Third, a mapping graph can be partitioned into disjoint subgraphs, such that 
each subgraph is a ugate. This term is defined in Figure 2. This restriction 
imposes a degree of regularity on the structure of a mapping graph, without 
sacrificing any expressive power. This regularity greatly simplifies implementa- 
tion. A good mental picture of a mapping graph is a directed graph containing 
interconnected, but isolated clumps of nodes, where each clump is a ugate. 

Fourth, a mapping graph must be reduced: there can not exist two choice 
nodes with logically equivalent outputs. This restriction can be met by merging 
ugates. If ugates u and v have logically equivalent choice nodes, then copies of 
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the positive and negative AND2s of u are added to the equivalent halves of v. If 
two AND2s of V now have identical inputs, one is discarded. All fanouts of the 
positive and negative outputs of u are connected to the equivalent outputs of v. 
The nodes of « are deleted. This restriction not only shrinks the graph, but also 
provides greater freedom to the mapper by collecting together all known ways 
of expressing a function with AND2s and inverters. Intuitively, wherever one 
such expression was required, all are now available. 

An AND2/INV network encoded by a mapping graph is defined as follows. 
Consider a depth-first search rooted at a primary output of the mapping graph 
with primary inputs at the leaves. At an AND2 node, the search proceeds to 
both fanins. At a choice node, the search proceeds to only one fanin. The 
nodes and arcs traversed in the search form a tree, assuming nodes and arcs 
visited more than once are duplicated. Replace each choice node in the tree by a 
wire connecting the output of the choice node to the single fanin explored by the 
search. See Figure 1. Each complete AND2/INV network encoded in a mapping 
graph consists of one such tree for each primary output. The set of all complete 
AND2/INV networks obtainable in this way is the set of AND2/INV networks 
encoded by a mapping graph. 

3.2 Graph-Mapping 

This section describes the graph-mapping procedure, an extension of tree- 
mapping [9]. The algorithm transforms a mapping graph ft into a mapped net- 
work in which primary outputs are driven by disjoint trees of gates with primary 
inputs at the leaves. In practice heuristic methods for sharing logic between trees 
are essential, but in the present theoretical discussion we assume no sharing is 
used. 

3.2.1 Preliminaries. This section reviews three notions used in the 
graph-mapping algorithm: matching, mapping, and cost. 

For a given choice node and library gate, matching identifies all single-output 
subgraphs of n rooted at the choice node which are logically equivalent to the 
library gate. Since numerous methods for matching are available [7], we only 
formalize the problem. 

An AND2/INV decomposition of the boolean network for a library gate is 
called a pattern. A match 0 of a pattern at an AND2 or inverter node a in 
mapping graph /r is a bijective function such that: (1) (|) projects each internal 
node in the pattern to an internal node in fi. (2) (|) projects the fanin of the 
pattern’s primary output to a. (3) Nodes b and <|)(&) are either both AND2s or 
else both inverters. (4) For nodes b, c in the pattern there is a directed arc from b 
to c if and only if there is a directed path from ^{b) to (|)(c) crossing only choice 
nodes. (Intuitively, an arc in the pattern may reach across a choice node in the 
mapping graph.) Every match at internal node a is also considered a match at 
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the choice node which is the fanout of a. Let C be the image of (|). The nodes in 
C and choice nodes between nodes in C are said to be covered by the match (]). 
A fanin of a node in C which is not covered by the match is ufanin of the match. 
From this definition, every fanin of a match is a choice node. 

A mapping at a choice node c in /i is a recursively-defined set of matches. 
Each match corresponds to a library gate and so a mapping uniquely defines a 
tree of library gates with primary input nodes at the leaves. Generally, a mapping 
consists of a match 0 at c and the union of a mapping at each fanin of 0. In the 
special case when c is the fanout a primary input node, either the mapping is 
the empty set or else the recursion continues as before. The latter possibility 
addresses the case of a primary input driving a chain of inverters and buffers. 

Each such mapping has a cost. This cost may measure the total area of the 
tree of gates, the arrival time at the root of the tree, etc. Generally, any no- 
tion of cost used in tree-mapping [4, 17] is acceptable^. However, to guarantee 
that graph-mapping terminates, the library and cost notion must be designed so 
that an optimal implementation exists; more precisely, there can exist no in- 
finite sequence of logically equivalent trees of gates with increasing size and 
non-increasing cost. At each choice node c in ^ the graph-mapping procedure 
maintains a value cost(c). This is the cost of the least-cost mapping at c yet 
found. 



3.2.2 The Procedure. This section describes graph-mapping as a 
four-step procedure. The first two steps are initializations. The third and fourth 
steps are analogous to the forward and backward passes of tree-mapping, but 
with a generalized definition of “match”. 

First, for each choice node c with a fanin which is a primary input, cost(c) is 
assigned an externally-supplied initial value. For example, if cost measures the 
arrival time at the root of a tree of gates, then this initial value is the input arrival 
time. For every other choice node c in ju, cost(c) is set to infinite. 

Second, each ugate u is assigned a number, label {u), such that the following 
three conditions hold. If there is only a directed path from a node in u to a node 
in V, then label {u) < label{v). If there is also a directed path from a node in v to 
a node in u, then label (u) = label {v). If there is no directed path between a node 
in u and a node in v, then label (u) ^ label{v). This requires a straightforward 
graph algorithm which assigns increasing labels from inputs to outputs and the 
same label to all ugates in a cycle. 

Third, a forward pass is made through p. Ugates are visited in order of in- 
creasing label and a function Match is called on each visited ugate. This func- 
tion considers mappings at each choice node c in the ugate and updates cost( c) 
whenever a new, lower-cost mapping is found. CostFunc is analogous to the 
cost function used in tree-mapping. If there is set of ugates all with the same 
label, then Match is iteratively called on every ugate in the set until there is a 
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procedure Match(M) 
do 

for each choice node c in m 

for each match (j) of a library gate g at c 
let /i ... /* be the fanins of match (|) 
let cost(c) = MlN(cost(c), 

CostFuncfg, cost(fi), cost(fk))) 
while the cost at some node changed during this iteration 

Figure 3. Match function 



complete iteration during which no cost value changes at any choice node in any 
ugate in the cycle. 

Fourth, a second pass is made through /i by working backward recursively 
from each primary output. The purpose is to construct a mapped network re- 
flecting the cost values computed in step three. The method is analogous to that 
used in tree-mapping. 

3.2.3 Effectiveness. The graph-mapping algorithm is as effective as 
applying tree-mapping to every encoded AND2/INV network, but generally far 
more efficient. 



Theorem 1 Suppose graph-mapping is applied to a mapping graph p, and tree- 
mapping is applied to an AND2/INV network 6 encoded in p. For each primary 
output, the implementation produced by graph-mapping has cost less than or 
equal to the implementation produced by tree-mapping with the same CostFunc. 



The algorithm usually runs in linear time in the number of nodes in the map- 
ping graph (steps one, two, and three) plus the size of the mapped network (step 
four). The while loop in Match typically runs a small number of times. Loops 
in step three are uncommon in practice. 

The efficiency of graph-mapping compared to exhaustive application of tree- 
mapping is therefore closely related to the compactness with which a mapping 
graph encodes a set of AND2/INV networks. In fact, the mapping graph repre- 
sentation is often extremely effective. For example, there are 2.22 x 10^^ distinct 
AND2/INV decompositions of the MCNC91 cht benchmark. However, this en- 
tire set can be encoded in a mapping graph with 400 ugates containing a total of 
599 AND2 nodes using the procedure described in the next section. 
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Figure 4- Two transformations of an AND2/INV network: (a) associative transformation (b) 
inverter transformation. 



з. 3 A-Mapping 

For a boolean network T|, the set A,, is defined as the closure of the set of 
AND2/INV decompositions of ti under the associative and inverter transforma- 
tions. These transformations are depicted in Figure 4. This section defines A- 
mapping, a mapping procedure which explores the entire space A,,. The follow- 
ing three-part procedure constructs a mapping graph /r encoding Ar^. A-mapping 
is completed by applying graph-mapping to ju. 

First, T| is used to produce a new boolean network fj. Each node n of r\ 
generates a set of multi-input AND’s and inverters in fj representing the sum- 
of-products /„. Each AND and inverter is a single node in fj. All pairs of 
inverters in series are deleted from fj. If one AND drives another in fj, the two 
are combined to form a single AND with more inputs. 

Second, each node of fj is translated to a set of nodes in fi. The translation 
works from inputs to outputs. A particular ugate output, denoted root(n), is 
associated with each node n in the network fj after it is processed. Each primary 
input n in fj generates a primary input ugate urn where root(n) is the positive 
output of M. An inverter n with fanin m does not generate a ugate. However, 
if root(m) is an output of the ugate u, then root(n) is the other output of ugate 

и. A primary output n with fanin m generates a primary output ugate driven by 
root(m). The most complex case is a multi-input AND node n. For each subset 
of two or more inputs, a ugate is created in the positive half which contains 
an AND2 for each partition of the subset into two non-empty, disjoint parts. 
root(n) is the positive output of the ugate associated with the full set of inputs. 
See Figure 5 for a three-input example. 

Finally, /r is reduced as described in Section 3.1. 

Theorem 2 For a general boolean network 'n, every element of A-^ is encoded 
in the mapping graph p generated by the above construction. 

As a corollary, the result of A-mapping is as good or better than obtained 
by applying tree-mapping to every possible AND2/INV decomposition of the 
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Figure 5. A portion of a mapping-graph which encodes all AND2/INV decompositions of y 
= a b c is shown. Ugates are circled and the associated subset of inputs is indicated. The ugates 
producing signals a, b, and c are not shown. 



original network. All decisions taken blindly in phase (2) of the conventional 
synthesis approach, AND2/INV decomposition, are now taken optimally during 
technology mapping. 

A-mapping often explores a space larger than A,,. This is because extra 
AND2/INV decompositions may be generated during the reduction operation 
in step three of the mapping graph construction. 

In practice, it is usually feasible to encode At, in a mapping graph provided no 
AND in the intermediate network has more than about 10 inputs. This means 
that A-mapping can be applied to most practical examples. Of course, if there is 
a larger AND in fj, it is still possible to encode numerous decompositions of the 
node. 



4. Dynamic Logic Decomposition 

In this section, we extend A-mapping, so that the set of AND2/INV networks 
is dynamically modified during technology mapping. This process is done by 
dynamically performing logic decomposition on a mapping graph. 

4.1 A-Mapping 

We first define a new transformation on an AND2/INV network, the vehi- 
cle for dynamic logic decomposition. The transformation is based on the dis- 
tributive law, and is referred to as the distributive transformation. The dis- 
tributive transformation replaces xy-t-xz by x(y-|-z), which is illustrated in Fig- 
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(a)Distibuted Patteiraddfe bo) 



(b) Factored Patl 



Figure 6. Distributive Transformation 



ure 6, where Figure 6-(a) corresponds io xy-\-xz and Figure 6-(b) corresponds 
to We call the structure shown in the rectangle of Figure 6-(a) the dis- 

tributed pattern, or D-pattern for short. Similarly, the structure shown in Fig- 
ure 6-(b) is referred to as tht factored pattern, or F -pattern for short. Note that 
the functionality of a D-pattem is the complement of that of the corresponding 
F-pattem. 

The idea of dynamic decomposition is simple: while traversing a mapping 
graph during technology mapping, we identify D-pattems and add the corre- 
sponding F-pattems to the mapping graph. We call this technology mapping 
procedure A-mapping. We illustrate the basis of the procedure in this section, 
and refine it in Sections 4.3 and 4.4. 

The A-mapping procedure is an iteration of the graph-mapping procedure. 
Once a mapping graph is constructed for a given boolean network, we apply 
graph-mapping while finding D-pattems and adding F-pattems. Once graph- 
mapping is done, we check if there is a solution which meets user-specified con- 
straints. If so, we terminate A-mapping by generating the result. Otherwise, we 
apply the graph-mapping procedure again, while identifying more D-pattems, 
since new D-pattems might have been introduced when F-pattems were added 
in the previous iteration. This process is iterated until either a desired solution 
is found or no new D-pattem is found. 

The core procedure is Decomp_andjnatch, illustrated in Figure 7. This pro- 
cedure is called in graph-mapping as a substitute for Match, the function defined 
in Section 3.2 for finding matches at a ugate. Decomp_and_match is identical 
with Match if the ugate u being processed is not of type AND. If the type of 
u is AND, Decomp_andjnatch identifies D-pattems and adds F-pattems, be- 
fore calling Match. More precisely, for each polarity p € {0,1}, we look at 
each AND2 a„ of Hu, the set of AND2s of u for the polarity p. We check 
if a D-pattem matches at If this is the case, we constmct the correspond- 
ing F-pattem shown in Figure 6-(b), and add it to the mapping graph. Since the 
functionality of the resulting F-pattem is the complement of that of the D-pattem 
we identifi^, the root AND2 of the F-pattem, denoted by in Figure 6-(b), is 
added to Hu, the set of AND2s of u whose polarity is the opposite of p. Recall 
the stmcture of an AND ugate shown in Figure 2-(c). 
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procedure Decomp-and-match(u) 
if u is AND ugate 

for each polarity p € {0, 1} 
for each G Hu 

if D-pattem is found at 

add F-pattem to the mapping graph 
/* t is a ugate for the second 
AND2a, of F-pattem. */ 

Match(t) 

Match(w) 



Figure 7. Dynamic Logic Decomposition (basic flow) 

To add the other AND2 of the F-pattem, denoted by a, in Figure 6-(b), we 
first create a new AND ugate, which consists of a, in the positive half. The 
negative output of the ugate fans out to Ug. The resulting mapping graph is then 
reduced. We invoke Match for the ugate containing at- 

Once this process has been applied for both polarities, we find matches for 
u and update the cost by calling Match. In the rest of the paper, we often say 
that dynamic decomposition is applied for a ugate u or an AND2 a, by which 
we mean that Decomp_and_match, or its subprocedure for a single AND2, is 
applied for u or a respectively. 

4.2 The Space Explored by A-Mapping 

Algebraic decomposition is a three-step operation defined on a node n of a 
boolean network: (1) find an algebraic divisor g of /„, (2) create or find a node 
m such that fm = g, and (3) replace /« by r-|- where h is a quotient 

obtained by algebraically dividing /„ by g, ^ is a product term of h, and r is the 
corresponding remainder. A definition of algebraic division is found in [3]. 

For a given boolean network t), let Ar^ denote the closure of A,^ under the 
distributive transformation. Then the following claim holds [10]. 

Theorem 3 Given a boolean network Tj, let x\be a boolean network obtained 
from T] by successively applying algebraic decomposition. Then every AND2/INV 
decomposition off\ is contained in At). 

This theorem claims that no matter how algebraic decomposition is applied 
on r), every AND2/INV decomposition of the resulting boolean network is con- 
tained in At,. Now, let D,, be the set of AND2/INV networks captured by ap- 
plying the A-mapping procedure on T]. Specifically, suppose we apply dynamic 
decomposition maximally on ugates of a mapping graph p, constmcted for r) as 
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described in Section 3.3, i.e. whenever a D-pattem is found, the corresponding 
F-pattem is added to jtt, and this process is repeated until nothing new is found. 

is given by the set of AND2/INV networks encoded in the resulting mapping 
graph. Since A-mapping subsumes Ar^, Dr^ subsumes A,^. Therefore, Theorem 3 
implies that by applying A-mapping, one can effectively explore the entire set 
of AND2/INV networks obtained from T] through algebraic decomposition and 
AND2/INV decomposition. By Theorem 1, an optimal result over all such de- 
compositions is obtained. We emphasize the point that A-mapping explores the 
space directly on a mapping graph during technology mapping, and there is no 
need to go back and forth between the technology independent phase and the 
technology dependent phase. 

One may generalize the algebraic decomposition so that after the third step, 
the expression of the node n is further divided by the complement of the divisor 
g. Such a generalized algebraic decomposition is also subsumed by A-mapping. 
Let fj denote the boolean network obtained by the generalized algebraic decom- 
position with a divisor g. In an AND2/INV decomposition of fj, a complemented 
expression g is represented as an inverter driven by an AND2/INV decomposi- 
tion for g. Since a mapping graph is reduced, i.e. there are no two distinct choice 
nodes that are logically equivalent, an AND2/INV network for g is also encoded 
in the mapping graph as an inverter driven by the choice node whose transitive 
fanin represents AND2/INV networks for g. Hence, the set of AND2/INV de- 
compositions of is contained in D^. However, it is known that the set is not 
necessarily contained in At^ in general, and thus Dr\ can be strictly larger than 
A,. 



4.3 Factor-Free Ugates 

In practice, the A-mapping procedure is used to speed up a timing critical 
region of a mapped network. Since the region is usually small and to maximize 
the effectiveness of dynamic decomposition, we initially collapse the region to 
a sum-of-products expression. Thus, the procedure is typically applied on a 
boolean network with a single node. In this and next sections, we modify A- 
mapping in order to improve efficiency of the procedure without sacrificing the 
quality of results for this typical case. 

A-mapping iteratively applies the graph-mapping procedure with dynamic de- 
composition. It is inefficient if the distributive transformation is applied many 
times on a ugate at which no new D-pattem can be found. In this section, we 
discuss a condition under which such an inefficient computation can be avoided. 

Intuitively, dynamic decomposition can be skipped for an AND2 a if all D- 
pattems have been identified in the transitive fanin of a. An AND2 of a ugate 
is said to be factor-free if there is no untransformed D-pattem in its transitive 
fanin, where a D-pattem is said to be transformed if the corresponding F-pattem 
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has been added to the mapping graph, and untransformed otherwise. A ugate is 
said to be factor-free if all of its AND2s are factor-free. 

The A-mapping procedure identifies this property locally. It marks an AND2 
a as FF (factor-free) if dynamic decomposition has been applied for a and both 
of its fanin ugates are marked as FF. A ugate is marked as FF if all of its AND2s 
are marked as FF. Initially, we mark ugates for primary inputs as FF. When 
Dynamic-decomp is invoked, we skip the procedure for an AND2 if it is marked 
asFF. 



4.4 Dynamic Decomposition with Deletion 

So far, A-mapping only introduces new AND2/INV networks, and never 
deletes existing ones. However, some of them might not lead to good results, 
and thus it is not worth having them around. The A-mapping procedure can be 
modified to detect such a situation, to some extent, and delete those AND2/INV 
networks. This section illustrates a basic idea, while details are found in [10]. 

The basic step of the deletion process is to delete an AND2 from the ugate 
it belongs to. Intuitively, it is worth having an AND2 a if there is a possibility 
that a is included in a mapping identified at some node of the mapping graph at 
which no mapping with at least equally good cost can be found without a. To 
identify this condition, we modify A-mapping as follows. 

First, when Match is applied on a ugate u, we compute a subset A' of the set 
A of AND2s of H such that for each choice node c in the transitive fanout of 
M, cost(c) remains same even if only A' is used for finding matches at u. This 
is done by using a notion called partial matches [7]. AND2s which are not in 
A' are candidates for deletion. The problem of finding A' with the minimum 
cardinality is known to be NP-hard [8], and thus we find one using a heuristic. 

Consider an AND2 a of u which is not in the set A'. We cannot immedi- 
ately conclude that a can be deleted. In fact, there are two cases where a should 
be retained. One case is that a new match could be found at a in the future. 
The other case is that a untransformed D-pattem containing a could be found 
in the future. If neither happens, a can be deleted without affecting the quality 
of the result. We detect the first case above by using the flag FF (factor-free) 
defined in the previous section. Since the first case does not arise as long as 
a is factor-free, we check if a is marked as FF. The second case above is de- 
tected by using another flag called DF (distribution-free). An AND2 a of a 
ugate u is said to be distribution-free if for each AND2 d which references u 
as a fanin, a is distribution-free and no untransformed D-pattem matches at d. 
If a is distribution-free, a will not be contained in a D-pattem when dynamic 
decomposition is applied the next time. Since this is tme for all the fanouts of 
a, no new D-pattem will be introduced in its transitive fanout in the rest of the 
procedure. In the modified A-mapping procedure, we traverse a mapping graph 
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between consecutive iterations of the graph-mapping procedure, and mark an 
AND2 a of a ugate u as DF if for each fanout aof u, d is marked as DF and 
dynamic decomposition was applied on d in the previous iteration. We delete a 
from M if a is not in A' and it is marked as FF and DF. If a is deleted, the fanins 
of a might have no fanout. We delete the fanins as well if this is the case. 

It is claimed that the modified A-mapping does not affect the quality of results 
in practice [10]. 

Theorem 4 Suppose that a boolean network r) consists of a single node and the 
associated expression is prime and irredundant. Then the cost of a mapped net- 
work for T| obtained by the A-mapping procedure with the modifications stated 
in Sections 4.3 and 4.4 is no worse than that given by the original A-mapping. 

5. Experimental Results 

These procedures were implemented in a synthesis system called SynFul. 
SynFul is developed on top of SIS [12], and inherits technology independent 
transformations from SIS. Operations related to technology mapping have been 
newly implemented. The implementation is highly optimized, with extensions 
for handling sequential circuits and generating non-tree implementations [10]. 

We conducted experiments on MCNC91 benchmark examples of combina- 
tional circuits, and compared the results to SIS-1.2. The library used is the one 
modeled for a commercial design project in progress at the authors’ affiliation. 
We specified delay characteristics so that both systems computed the cost in 
exactly the same way to find the fastest possible implementation. 

For SIS-1.2, we used a sequence of operations aimed for timing-oriented syn- 
thesis [13, 17]. A procedure for restructuring a mapped network [13] was also 
applied. We consulted SIS developers for detailed usage of these procedures to 
derive the best performance of the system [14]. 

In SynFul, we first applied the A-mapping procedure where, for each ex- 
ample, four different sets of technology independent transformations were first 
applied on the initial boolean network. AND2/INV decompositions for each 
of the resulting boolean networks were encoded into a single mapping graph p. 
The rationale is that a variety of decompositions can be encoded in a mapping 
graph in this way. The graph-mapping procedure was then applied on p. Once a 
mapped network was obtained, we further applied the A-mapping procedure on 
a timing-critical region. Such a region was identified using a heuristic suggested 
in [13]. 

The results are shown in Table 1. For SIS-1.2 and A-mapping, Delay is the 
arrival time of the most critical primary output. The most critical output in the 
network generated by A-mapping was re-synthesized by A-mapping, and the 
resulting arrival time is shown under Delay of A-mapping. The delay unit is 
set to the minimum delay from an input pin to the output pin over all the gates 
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Name 


SIS-1.2 


SynFul 


A-mapping 


A-mapping 


Delay 


Area 


Delay 


Area 


Time 


Delay 


Area 


Time 


cc 


12.0 


167.7 


9.5 


192.0 


16.5 


8.0 


193.3 


3.3 


cm82a 


13.2 


86.0 


10.5 


58.3 


8.9 


7.0 


78.7 


8.8 


pcie 


15.0 


266.7 


12.0 


156.7 


35.7 


10.5 


176.7 


11.5 


pml 


13.5 


135.0 


9.8 


135.0 


47.1 


9.2 


131.3 


14.0 


x2 


17.7 


151.0 


9.8 


109.7 


58.0 


9.8 


109.7 


24.1 


cml63a 


15.5 


130.7 


12.0 


111.3 


16.5 


9.7 


121.0 


39.9 


comp 


32.5 


401.3 


26.0 


502.7 


193.4 


21.2 


454.3 


43.5 


cu 


16.0 


145.0 


11.8 


147.3 


27.7 


9.8 


149.0 


50.3 


il 


12.7 


183.0 


11.5 


132.3 


24.2 


9.8 


111.0 


62.3 


cml51a 


15.0 


77.0 


10.5 


66.3 


14.5 


9.0 


88.3 


103.7 


cht 


9.2 


459.7 


9.2 


320.7 


10.7 


8.2 


323.7 


2.8 


ttt2 


21.5 


530.7 


15.3 


576.0 


112.4 


12.7 


606.7 


134.4 


cml62a 


12.7 


164.7 


12.0 


111.3 


21.4 


9.7 


122.7 


154.8 


terml 


25.8 


612.3 


19.5 


658.3 


314.9 


15.7 


602.0 


266.9 



Table 1. Experimental Results (Timing-Driven Technology Mapping) 



Specified in the library. Area shows the area of the entire network obtained by 
each procedure, where the area unit is the smallest area among the library gates. 
Time is the CPU time in seconds measured on VAXstation 6000. 

The results show that SynFul finds faster implementations than SIS-1.2 over 
all the examples tried. The results are remarkably better for some examples, 
such as x2, ttt2, and comp. Even though A-mapping already outperforms SIS- 
1.2, further improvements are usually possible by using the A-mapping proce- 
dure. Note that A-mapping effectively tries all the algebraic decompositions 
using the real cost function in technology mapping. Therefore, we consider the 
procedure especially effective when the arrival times of input signals vary and 
a decomposition for the fastest implementation cannot be easily identified by 
technology independent transformations. The CPU time is reasonable. 

6. Conclusion 

Conventional approaches distinguish three phases in logic synthesis: (1) tech- 
nology independent processing, (2) AND2/INV decomposition, and (3) technol- 
ogy mapping. Although decisions made in the first two phases critically affect 
results, accurate delay, area, and loading information is not available until the 
third phase. Therefore, these critical decisions must be taken almost arbitrarily, 
producing inferior results. Our approach allows a key part of phase (1) and all 
of phase (2) to be combined with mapping. 

The cornerstones of our approach are the mapping graph data structure and 
the graph-mapping algorithm. Built atop these pieces, the A-mapping proce- 
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dure combines the AND2/INV decomposition phase of synthesis into mapping. 
Applying A-mapping to a boolean network produces a result as good as can be 
obtained by applying tree-mapping to every AND2/INV decomposition. The 
A-mapping procedure goes further by performing algebraic decomposition di- 
rectly on a mapping graph during technology mapping. Applying A-mapping to 
a boolean network is as effective as applying tree-mapping to every AND2/INV 
decomposition of every algebraic decomposition of the network. Our experi- 
ments suggest that A-mapping and A-mapping improve results substantially over 
the conventional approach. 

A theoretically attractive procedure, which we have not implemented, would 
apply both the distributive and associative transformations during mapping. In 
theory, such a procedure would be superior even to A-mapping. 

Notes 

1. As with ordinary techniques, the generalized technology mapping is also tree-based, and thus opti- 
mality is claimed on tree implementations. A heuristic is applied to generate non-tree implementations. 

2. The present discussion assumes the set of possible costs is totally ordered; like tree-mapping, however, 
graph-mapping extends to partially ordered costs. 
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Abstract 

When ICCAD began in 1983, we had no robust tools for device modelling, analog circuit syn- 
thesis, electrical timing simulation, transistor-to-logic abstraction, or large-scale custom circuit 
tuning. Today, all these techniques are in common industrial usage, most commercially available. 
The seven papers in this section fundamentally transformed the way in which we model, manip- 
ulate, and solve these difficult tasks today. They are linked by a common thread of “deep circuits 
innovation”. 



1. Introduction 

At first glance, the papers in this section might appear to be rather, well eclec- 
tic, spanning as they do a range of circuit-related topics from modelling to syn- 
thesis to verification to extraction. However, it would be a profound disservice to 
the papers selected in this section to impute that they lack some deep connection. 
In this circuits area, the core contributions are all about creating the essential, 
novel problem abstraction that transforms a difficult electrically oriented task 
into something that now supports efficient synthesis, or accurate modelling, or 
robust optimization, or deep verification. The papers in this section all share 
this defining trait: they introduce a deep, novel model of the problem that al- 
lows us to make new progress on design automation. Before these abstractions, 
efforts existed in each of these areas-but those efforts lacked rigor, robustness, 
scope. After these papers, the way in which we viewed these difficult tasks was 
fundamentally transformed. 
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This section describes seven papers chosen from ICCAD that are linked by 
this “deep circuits innovation” metric of excellence. They are, in chronological 
order: 

1 TECAP2: An Interactive Device Characterization and Model Develop- 
ment System by Ebrahim Khalily, Peter H. Decher and Darrell A. Teegar- 
den appeared in ICCAD 1984 [1]. Today, it is a bedrock principle of good 
design that good circuits only result from fabrication technologies with 
well-characterized devices and correspondingly accurate device models. 
Today, a deep infrastructure of model extraction, model fitting , and model 
validation tools exists. In 1984, this was a radical idea. TECAP2 was 
among the very first efforts to show how one might build design automa- 
tion tools to help circuit designers with this fundamental task. 

2 TILOS; A Posynomial Programming Approach to Transistor Sizing by 
Jack Fishbum and A1 Dunlop appeared in ICCAD 1985 [2]. Today, ev- 
eryone who works on circuit optimization knows what posynomial means, 
and understands that one can create efficient, large-scale nonlinear pro- 
gramming problems that can model how MOS device sizing affects delay. 
This understanding, however, all dates to this single paper with its elegant 
and powerful geometric programming model of the MOS sizing problem, 
and its accompanying efficient solution technique. It is not a stretch to 
argue that TILOS revolutionized the way we treat large digital netlists, 
paving the way for today’s large-scale convex-solver device sizing tools. 

3 Automatic Synthesis of Operational Amplifiers Based on Analytic Circuit 
Models by Han Young Koh, Carlo H. Sequin and Paul R. Gray appeared in 
ICCAD 1987 [3]. 1987 was a banner year in the (relatively short) history 
of analog synthesis. Several important papers appeared that defined the 
first serious attempts at synthesis for analog. The tool described in this 
paper, OPASYN, was one of these pioneering efforts, showing a linked 
set of abstractions for topology selection, sizing, and first-cut layout for 
op-amps. 

4 SPECS2: An Integrated Circuit Timing Simulator by Chandramouli Vis- 
weswariah and Ronald A. Rohrer appeared in ICCAD 1987 [4]. By the 
end of the 1980s, classical SPICE simulation engines were showing their 
age, and were increasingly incapable of scaling to the very large designs 
appearing in this decade. Various piecewise linear or discrete models had 
appeared to attack this problem, but they lacked any rigorous foundation 
and were plagued with accuracy problems. SPECS2 introduced one of 
the very first numerically solid formulations for fast timing simulation of 
large digital circuits. 
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5 Analog Circuit Synthesis for Performance in OASYS hy Ramesh Harjani, 
Rob A. Rutenbar and L. Richard Carley appeared in ICCAD 1988 [5]. 
Analog circuit designers are nothing if not skeptical creatures. Just as the 
very first generation of analog synthesis papers appeared in 1987, the first 
volley of questions came forth from potential users, asking “well, can we 
do real high-performance designs with these tools?” The tool described 
in this paper, OASYS, was among the first to answer this question with a 
solid “yes”. 

6 Extraction of Gate Level Models from Transistor Circuits by Four-Valued 
Symbolic Analysis by Randal E. Bryant appeared in ICCAD 1991 [6]. It 
sounds ever so simple; I have an MOS netlist representing a digital de- 
sign, I would like to extract an equivalent gate-level description so that I 
might simulate this design more quickly. Oh, and by the way, don’t bother 
me about any nasty electrical issues like bidirectional transistors or charge 
sharing or multiple signal strengths. Just do it. The TRANALYZE tool 
introduced in this paper was the first to be able to do this extremely im- 
portant form of extraction in a robust way. Prior efforts all failed on one 
difficult circuit or another. TRANALYZE was the first algorithm to han- 
dle this problem in a robust fashion, with an elegant, rigorous theoretical 
foundation. 

7 Optimization of Custom MOS Circuits by Transistor Sizing by Andrew 
R. Conn, Paula K. Coulman, Ruud A. Haring, Gregory L. Morrill, and 
Chandu Visweswariah. ICCAD 1996 [7]. Sometimes, there is simply no 
substitute for the best design. In applications such as microprocessors, in- 
dividual circuit blocks are extensively-and laboriously-tuned for optimal 
power and delay. There is no substitute here for the accuracy of a real 
circuit simulator to judge the quality of the final solution. This paper in- 
troduced techniques to couple state-of-the-art gradient optimizers (based 
on powerful trust-region methods) with state-of-the-art circuit simulators, 
for the purpose of efficient, automated local improvement of maximum- 
performance circuit blocks. The resulting tool, JiffyTune, was only the 
first of a series of powerful optimizers built and used with great success 
to optimize custom circuit tuning. 

In the remainder of the paper, we describe each paper in a bit more detail, and 
connect them briefly to other important work in their area. 

2. Fitting Device Models 

The first paper in this section, TECAP2: An Interactive Device Characteriza- 
tion and Model Development System by Ebrahim Khalily, Peter H. Decher and 
Darrell A. Teegarden appeared in ICCAD 1984 [1]. The semiconductor industry 
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is moving to nanometer device geometries involving physical phenomena that 
are hard to model. For many circuits now operating at high frequencies using 
DC device models, albeit accurate DC models, are no longer sufficient. Simul- 
taneously, accurate models and process statistics are critical to the convergence 
of circuit simulation. Furthermore, time-to-market pressures on the overall de- 
velopment cycle place device models on a critical path and push modelling en- 
gineers to sometimes provide models prior to performing any measurements on 
silicon [8, 9]. The actual development of device models is a critical subject and 
an excellent reference for MOS device modelling is [10]. 

More than 15 years ago, the TECAP2 tool generated a new technology that 
has prepared the industry to face our existing challenges. The innovation of 
this paper stands in the fact that this modelling tool is device independent and 
within a single environment combines all functions necessary to model a device 
wrapped up around a user interface. 

■ Device independent: TCAP2 can use built-in models, yet providing users 
with sufficient flexibility in order for them to create their own. This flex- 
ibility was a major step forward considering that device modelling tools 
used to be closed and proprietary. 

■ Integrated: TCAP2 has the ability to characterize a device, extract a 
model, and then to graphically analyze and compare the simulated result 
with the data measured on different geometries. 

The productivity gain pioneered by the Hewlett Packard’s team, has formed 
the basis to tackle the current 90 nanometer process challenges. Many current 
models use more than 150 parameters. For the past 10 years, the full process 
database has grown from 50KB to 50MB and the number of experimental de- 
vices to be characterized have increased from 5 to more than 30. The overall 
modelling technology process inner loop requires an average work phase of six 
weeks. In order to adapt to process technologies variation, modelling engineers 
need the flexibility to modify and extend model parameters beyond those offered 
by standard models. To control these variations, the state-of-the-art tools pro- 
vide device designers and process engineers with accurate models and statistical 
analysis capabilities helping them to determine nominal performance as well as 
comer cases. 

Modem device modelling software such as UTMOST from Silvaco [11], or 
the TCAP2’s next generation IC-CAP (Integrated Circuit - Characterization and 
Analysis Program) from Agilent [12] demonstrates more capabilities as well 
as better performance. Data acquisition, instmment control, parameter extrac- 
tion, graphical analysis, simulation, optimization, powerful characterization, 
and statistic analysis are integrated into a single software environment. 
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TECAP2 was the foundation of a new generation of modelling software that 
has allowed what was previously a highly skilled hand crafted process into a 
high quality production software. 

3. Optimal Transistor Sizing for Digital Circuits 

The second paper in this section, TILOS: A Posynomial Programming Ap- 
proach to Transistor Sizing by Jack Fishbum and A1 Dunlop appeared in ICCAD 
1985 [2]. The seventh paper in this section, Optimization of custom MOS circuits 
by Transistor Sizing by Andrew R. Conn, Paula K. Coulman, Ruud A. Haring, 
Gregory L. Morrill, and Chandu Visweswariah appeared in ICCAD 1996 [7]. 
Both of these papers deal with finding digital device sizes to optimize a number 
of circuit characteristics including timing, power, area etc. However, they used 
two different approaches. TILOS uses a equation-based approach while Jiffy- 
Tune uses a simulation-based approach. In the equation-based approach a set 
of equations model the circuit’s behavior. These equations can either be derived 
automatically or hand-crafted. In the simulation-based approach the circuit’s 
behavior is obtained through numerical simulation. 

Complexity and time-to-market pressures urge design teams to use high-level 
languages and synthesis tools in order to incorporate more and more functional- 
ity on the same silicon to reach System-On-Chip complexities. This traditional 
design flow improves the design entry productivity. However, a final solution 
can only be found provided that some critical functions are optimized in terms 
of size, performance and power. In order to meet their specifications, dedicated 
custom circuit design tools were required. New process technologies acceler- 
ate the integration, which increase the size and the complexity of these critical 
functions. Custom blocks consist of hundreds of transistors and the associated 
efforts needed to implement an optimal solution is non negligible. 

The TILOS [2] paper included in this commemorative selection was the pi- 
oneering research that changed all that. Even though previous attempts had 
been made to optimize transistor sizes to meeting timing requirements [13, 14, 
15, 16, 17, 18, 19], this paper provided a technique to cast the problem into 
a convex posynomial form that ensures that we can find the global minima, 
i.e., a globally optimal solution. Convex programs have the property that any 
local optimum is certain to be globally optimal which is not the case for non- 
convex forms [20]. Consequently, non-convex problems face major hurdles with 
gradient-based schemes such as getting trapped in local minima, combinatorial 
(NP-complete complexity) and non-linearity with multiple minima in feasible 
solution spaces [21]. 

Nearly 15 years after the TILOS publication, the posynomial programming 
approach to transistor sizing remains the cornerstone of the equation-based op- 
timization approach. Even though standard Geometric Program form is non- 
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convex, it can be remodelled into a convex form by a simple change of variables: 
Each Xi is transformed to its natural logarithm by replacing it with exp(z,). As 
a result, the posynomial form allows for efficient geometric prograimning tech- 
niques for circuit sizing and optimization, thus widening the influence of the 
method. Should circuit characteristics not be castable into a posynomial func- 
tion, equations can be approximated to some posynomial format and this is a 
one-time manual effort. 

As today’s transistors sizing optimization algorithms cannot focus on delay 
alone independent of power consumption. The power consumption is currently 
one of the most significant challenges designers have to face. To tackle this 
problem, system level techniques such as low power modes and gated clock 
techniques are widely used. However, the situation is desperate enough to take 
drastic measures like special processes and very low voltages. In the future, cus- 
tom low power circuits should play a major role in order to address the problem 
relating to the power versus the performance trade-offs optimization [22, 23, 24]. 

The TILOS work has provided the VLSI industry with a solid base for transis- 
tor sizing to meet timing and other requirements. This single paper showed that 
it was possible to cast even very large problems using an elegant and powerful 
device sizing model using geometric programming techniques. It not a stretch 
to argue that TILOS revolutionized the way we treat large digital netlists, paving 
the way for today’s large-scale convex-solver device sizing tools. 

The seventh paper in this section. Optimization of custom MOS circuits by 
transistor sizing by Andrew R. Conn, Paula K. Coalman, Ruud A. Haring, Gre- 
gory L. Morrill, and Chandu Visweswariah appeared in ICCAD 1996 [7]. A well 
optimized circuit provides excellent leverage against the competition as long as 
it does not affect the time-to-market. Therefore, the industry needs to develop 
more tools in this field in order to help designers to improve their productivity. 
By combining technologies of a fast circuit simulator SPECS and a nonlinear 
optimizer LANCELOT, JiffyTune is the bedrock of this new technology. 

The JiffyTune system architecture integrates all functions necessary to au- 
tomatically design and optimize hardware blocks within a single software en- 
vironment. Even though the JiffyTune interface still requires an experimented 
user, it hides the custom-designed circuit optimization complexities and offers 
an interesting future evolution for the tool. 

The JiffyTune technology tackles 3 major problems encountered during cus- 
tom circuit design: 



■ Productivity: As per Table 2 in [7], “COLD” run (started from an untuned 
circuit) as compared to manually tuned circuits is very similar. Those 
results are very encouraging and illustrate the capacity of the tool, i.e., no 
a priori hand tuning is required. 
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■ Flexibility: For various reasons, specifications and technology processes 
change during the design phase affecting circuit optimization. On the one 
hand, it could induce custom designers to start their manual work from 
scratch, while, on the other hand, JiffyTune users can minimize the cost 
by rerunning the transistor size optimization after the tool’s parameters 
and input specifications have been updated. 

■ Accessibility: The tool’s power along with the user interface allows de- 
signers to work with limited experience and rapidly obtain an excellent 
result. 

The quality and the productivity gain demonstrated by Jiffytune, pioneered a 
flexible and efficient custom-design circuit methodology that can be targeted to 
multiple foundries and processes. 

4. Design and Synthesis of Analog Circuits 

The third paper in this section of the commemorative selection is Automatic 
Synthesis of Operational Amplifiers Based on Analytic Circuit Models by Han 
Young Koh, Carlo H. Sequin and Paul R. Gray from the University of Califor- 
nia, Berkeley. It was published in ICCAD in 1987 [3]. This paper provides 
the first description of the OPASYN op-amp synthesis tool. The fifth paper in 
our analog and digital circuit design commemorative selection is Analog Cir- 
cuit Synthesis for Performance in OASYS by Ramesh Hatjani, Rob A. Rutenbar 
and L. Richard Carley from Carnegie Mellon University. It was published in 
ICCAD in 1988 [5]. The high performance design capabilities and their impact 
on analog circuit synthesis in the OASYS analog circuit synthesis system was 
first mentioned in this paper. 

By the mid to late eighties, CMOS feature sizes had shrunk to below 3pm and 
large digital chips were routinely synthesized from high level hardware descrip- 
tion languages. As feature sizes continued to shrink, it was becoming amply 
evident that analog portions of mixed-signal systems on a chip (SoC) were tak- 
ing significantly more time to design even if they involved fewer devices and 
occupied a smaller percentage of the chip. In fact, anecdotal examples from 
ATT Bell Labs provided many examples of mixed-signal SoCs where less than 
10% of the mixed-signal chip that was analog took more than 90% of the design 
time [25]. The digital design process was significantly more formalized and 
followed a strict hierarchy which made it possible to ‘automatically’ synthesize 
such large circuits. Digital circuits are routinely designed hierarchically as HDL 
synthesis, gate-level synthesis and back-end physical design. Additionally, tim- 
ing closure is attempted at various steps in the process. 

However, at that time, and to a large extent even at the current time, the analog 
portions of such SoCs are largely still designed by hand where a designer first 
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generates a circuit schematic with device sizes which is then followed by a num- 
ber of SPICE and ‘tweak’ steps. With the result that the analog portions of SoCs 
tend to be the limiting factor in the design process of such systems. Time to mar- 
ket pressures and the desire for working first silicon has motivated the research 
into analog circuit synthesis systems. However, unlike digital circuits the strong 
interaction between device and process characteristics have not allowed for an 
easy decoupling of various design steps. The relative maturity of digital CAD 
techniques and relative independence of digital design and fabrication process 
parameters suggests that whenever possible analog circuits should be replaced 
by their digital counterparts. However, given that we have to interface with the 
analog external world, SoCs will always contain some analog and mixed-signal 
circuits continuing to motivate the need for CAD for analog circuits. 

The overall analog circuit design flow can roughly be broken up into three 
major steps: a) topology selection or schematic synthesis, b) device sizing and c) 
layout or physical design. Research has focused on all three aspects of the design 
flow. The set of papers in this section of this commemorative selection focuses 
on the first two steps of the design flow. However, significant progress has also 
been made in the physical design aspects of analog circuits [26, 27, 28, 29, 30]. 

The various approaches that have been used to tackle the analog circuit syn- 
thesis problem can roughly be classified into two groups. 

■ Knowledge-based methods: Early in the analog CAD research process 
it was realized that analog circuit design was different from the digital 
design process. In fact, analog and RF design has often been referred 
to as a ‘black art’ where only a handful of expert designers are success- 
ful at generating high performance circuits. So, some of the early analog 
circuit synthesis research has focused on roughly mirroring the ‘expert de- 
sign flow’. Analog designers use simplified equations for known circuit 
blocks and follow a simple design plan. Some of the earliest successful 
research using this approach include [31, 32, 5, 33, 34, 35, 36, 37, 38]. 
These design methods have been applied to a variety of analog and RF 
circuits and generally provide results that are comparable to hand-crafted 
designs. Unlike a number of numerical optimization techniques discussed 
next, knowledge-based techniques have been used for both topology se- 
lection and for device sizing [32]. However, the primary weakness of this 
approach lies in the coding of the design knowledge. Symbolic analysis 
techniques have later been developed in an attempt to automatically gen- 
erate the analytic equations [39]. Among these, the OASYS system [5] 
included in this section is one of the earliest pieces of research in this 
area. The OASYS system uses a strict hierarchy from system, circuit, 
block and component level. Measurements from fabricated circuits using 
designs generated by the OASYS system have been shown to match pre- 
dicted values and have shown performance comparable to hand-crafted 
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designs [32]. The ICCAD paper [5] showed that it was possible to syn- 
thesize high performance analog designs. 

■ Numerical optimization-based methods: A number of traditional numer- 
ical techniques, such as steepest descent and minmax formulations, have 
been used to tackle analog circuit design. Most of them have focussed on 
the device sizing problem including the first paper in this section of the 
commemorative selection [3, 40, 41, 42, 43, 44, 45]. OPASYN [3] uses 
simplified analytical models to model circuit performance. The program 
selects an op-amp topology and generates device sizes and bias currents 
to meet a given set of design specifications [3]. The use of analytical 
models to represent behavior, including large signal parameters such as 
slew-rate, ensure that the optimization process is extremely fast. How- 
ever, the OPASYN system, as with other traditional numerical technique 
based approaches can get ‘stuck’ in local optima and does not always pro- 
vide a solution even if there is one. To circumvent this problem variations 
of classical numerical techniques have been developed, including branch 
and bound [46], simulated annealing [47, 48] and geometric program- 
ming [49]. Interestingly, the branch and bound technique used in [46] 
is used for topology selection rather than for device sizing. Simulated 
annealing is another popular technique that in principle avoids local min- 
ima by initially accepting solutions that aren’t necessarily better at each 
program step. However, simulated annealing does not guarantee a glob- 
ally optimal solution nor can it guarantee that there is no solution when 
one is not found. The geometric programming method developed in [49] 
casts the analog device sizing problem using posynomial functions that 
are convex and can be solved quickly and globally. The primary problem 
that remains is to now cast the design problem into posynomial form. 

Analog circuit design continues to be an active research topic suggesting a 
problem that is significantly less tractable compared to digital synthesis. How- 
ever, after almost twenty years of active research it is extremely gratifying to see 
the successful establishment of startup companies that are focused on providing 
analog circuit synthesis tools [50, 51]. 

5. Timing Simulation 

The fourth paper in this section SPECS2: An Integrated Circuit Timing Sim- 
ulator by Chandramouli Visweswariah and Ronald A. Rohrer was published at 
ICCAD in 1987, and presented a method for fast timing simulation and mod- 
elling [4]. This period saw great research thrusts in the direction of developing 
fast simulators that could provide large speedups over SPICE while maintain- 
ing high accuracy. Like prior efforts such as SPECS [52] and MOTIS [53], 
SPECS2 also used discretized device models, but was able to overcome spuri- 
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ous algorithmic oscillations seen in earlier methods. The Simulation Program 
with Integrated Circuits Emphasis (SPICE) has been the industry standard for 
more than 25 years. Although transistor level circuit simulators are very ac- 
curate, they are nevertheless very slow. The ever-increasing number of modem 
circuits transistors is making the simulation of very large circuits impossible due 
to the inherent complexity of the circuit models. 

The publication [4] introduces a different approach to the simulation prob- 
lem emphasizing that the simulation effectiveness be not only focused on its 
accuracy, but also on both its feasibility and its productivity. Less accurate than 
SPICE, +/- 5 percent, the Simulation Program for Electric Circuits and Systems 
(SPECS) is about 70 times faster. Nevertheless, the fast simulation, per se, en- 
ables the simulation of bigger circuits. Yet, the SPECS simulator offers more 
than simulation acceleration. It is also a different way of thinking and a new 
approach showing that the results achieved with the simulation accuracy can 
be traded off with those of the simulation speed, especially in regard to MOS 
circuits. 

More technologies have derived from this precursor work. By combining 
switch-level with event driven technologies, simulators such as the IRSIM from 
Berkeley have been found to be 1,000 times faster than SPICE demonstrating 
a reasonable accuracy in terms of delay and power estimation for the MOS cir- 
cuits. 

More than 10 years after SPEC2, CAD companies have launched a new gen- 
eration of simulators maintaining the spirit of SPEC2. Cadence introduced their 
Affirma tool. Accelerated Transistor-level Simulator (ATS), in June 1999, Men- 
tor Graphics introduced Mach TA in June 2000. The current market leaders 
include Nassda with HSIM and Synopsys, which announced Nanosim in May 
2001 

Netlists consisting of up to several dozen millions of transistors can be man- 
aged and are able to run more than 1,000 times faster than Spice with an ac- 
curacy trade-off of just a few percent. These simulators provide the low-power 
designers with the information they need in order for them to optimize the per- 
formance of their designs. Mixed-signal netlists and memories are supported, 
delivering a high-speed and a high-capacity verification for complex chip de- 
signs. Lookup table and model-in techniques have improved the third generation 
of transistor level simulators. 

The SPECS2 work has had a significant impact on the state of simulation, 
and was part of a wide range of efforts during and after that time, including [54] 
and [55]. Its remarkable combination and speed and accuracy has resulted in 
its use in the inner loop of circuit optimizers, such as in the development of 
fast circuit optimization tools, such as in [7] and [56]. The practical impact 
of this work is visible in its widespread use in industry, particularly at IBM. In 
1987, Visweswariah and Rohrer led the way of this new simulators generation by 
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settling as the precursors of a different approach [57, 58, 59, 60, 61, 62, 63, 64]. 



6. Extracting Gate-Level Models 

The sixth paper in this section. Extraction of Gate Level Models from Tran- 
sistor Circuits by Four-Valued Symbolic Analysis appeared in ICCAD 1991 [6]. 
Every year System-On-Chip has become increasingly complex while the as- 
sociated number of transistors has grown exponentially. Consequently, tradi- 
tional transistor simulators were not able to cope with the increased complex- 
ity and reached the limit of their capability. Although electrical simulators 
have dramatically improved the simulation time, they did not catch up with 
the level of complexity, thus remaining limited. In order to be able to simu- 
late a complete system-on-chip at transistor level, the industry was anticipat- 
ing the development of a new technology. This publication appeared as the 
cornerstone of new activities that have changed the full functional verification 
flow [65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75]. 

Today, the transistor abstraction technique is based upon two different meth- 
ods: pattern matching and the switch-level symbolic analysis as described in the 
publication. One of the best advantages that pattern matching has to offer stands 
in the fact that both digital and analogue designs can be extracted. However, the 
efficiency only relates to the quality of the associated library. The switch-level 
symbolic analysis is very effective as the binary decision diagrams are applied 
to represent and manipulate the four-value functions. It can handle any com- 
binational digital designs in a way that is totally design flow and design style 
independent, which is one of the strongest assets of the technique. 

Based on the technology described in the paper, the “laybool” tools for tran- 
sistor abstraction was developed in 1994 [71]. The tool has combined the switch- 
level analysis with other techniques in order to be able to deal with more com- 
plex circuits. After the first switch-level analysis was carried out, new algo- 
rithms were implemented handling flip-flops, latches, arithmetic operators, RAM, 
dynamic logic, precharged logic etc. 

The transistor abstraction technology is now integrated within the classic 
functional verification flow promoting the RTL sign-off. Designers first extract 
a gate-level description from the transistor netlist. Combined with the equiv- 
alence checking technology, the abstracted netlist is then matched up with the 
RTL model. This method fully certifies the layout view equivalence with the 
RTL level model. The same abstracted gate-level netlist is ported on a hardware 
accelerator or emulator to test the reset mechanism and makes the extraction of 
complex chips such as a complete 64-bit processor possible. 

Time-to-market is continuing to put pressure on the development team. IP 
reuse is one of the solutions the design community has found to improve their 
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productivity. A top-level model simulated early in the design cycle is a key fac- 
tor to reduce the development time. Despite corporations efforts to normalize 
IP interfaces, only blocks developed recently abide by these strict rules, thus 
increasing the design efforts but facilitating the reuse. IP blocks developed sev- 
eral years ago are part of the companies asset even though they sometimes only 
offer a layout view. The use of the transistor abstraction technique enables a 
gate-level netlist to be created as well as all of the legacy blocks to be integrated 
within the new design flow. 

The laybool tool is currently commercialized as Lynx-LB by Avant!. Since 
then, other companies such as Verplex “transformal-LTX” and Avertec “Yagle” 
(derived from a PARIS/LIP6 work) [75], based on the switch-level analysis, 
have developed similar tools. TNI-Valiosys have launched the ‘TLL” combin- 
ing pattern matching and switch-level analysis techniques. Notwithstanding the 
fact that transistor abstraction tools are integrated in most modem tool flows, 
they still are not a push button technology, as they may need to be adapted to 
different design styles. Still, Randal E. Bryant with his paper [6] is a pioneer in 
a technology that has significantly impacted the microelectronic industry. 

7. Conclusions and Acknowledgements 

When ICCAD began in 1983, we had no robust tools for device modelling, 
analog synthesis, electrical timing simulation, transistor-to-logic abstraction, or 
large-scale custom circuit tuning. Today, all these techniques are in common 
industrial usage, most commercially available. The seven papers in this section 
fundamentally transformed the way in which we model, manipulate, and solve 
these difficult tasks today. They are all linked by a common thread of “deep 
circuits innovation”. 

The authors wish to acknowledge the valuable discussions and input from 
various friends and colleagues. In particular, the help provided by Prof. Sachin 
Sapatnekar of the University of Minnesota. 
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Abstract 

A computer aided design system for integrated circuits is not complete without accurate circuit 
simulation capabilities. Accurate simulation is not possible without accurate models and model 
parameters. TECAP2 (Transistor Electrical Characterization and Analysis Program) is a pro- 
gram that greatly simplifies the development of models, the extraction of model parameters from 
measured data, and the comparison of the effectiveness of various models. 



1. Introduction 

The integrated circuit designer relies on circuit simulation to verify circuit 
performance. Circuit simulation accuracy is directly affected by transistor mod- 
eling accuracy. Transistor models are made up of mathematical equations that 
contain constants that are fixed for a given fabrication process; these constants 
are called model parameters. The actual values of the model parameters, as 
well as the model equations, determine the extent to which a model can accu- 
rately predict device electrical characteristics over a range of device geometries. 
TECAP2 provides an environment for developing accurate models and for ex- 
tracting model parameters values from measured data. The TECAP2 system 
evolved from the original TECAP [1], which was developed at Stanford Univer- 
sity in the late 1970’s. 

2. TECAP2 SYSTEM 
2.1 User Interaction 

TECAP2 was designed with the objective of providing ease of use for the 
occasional user, as well providing advanced features and capabilities for the 
expert user. A simple menu structure has been implemented that avoids the 
problem of getting lost in menu hierarchies. The TECAP2 system commands 
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are organized into several subsets, as shown in Figure 1. The major topics of 



Subset 


S) 


S1,S7 


SI) 


Simulate 


S2) 


Int.act.sim. 


S4) 


Select model 


S5) 


Enter parameters 


S7) 


Plot data -h axis 


S8) 


Plot data only 


S9) 


Print data 


SIO) 


Plot 2nd output 


Sll) 


Plot 3rd output 


S12) 


Save data 



TECAP2 


Main Menu 


D) 


Device data 


C) 


Connections 


U) 


Use setup 


M) 


Measure 


E) 


Extract 


S) 


Simulate 


0) 


Output control 


P) 


Plot control 


F) 


Filer 


B) 


Build Setup 


I) 


Input sequence 


Q) 


use seQuence 


A) 


Auxiliary 



Figure 1. TEC AP2 command menu. 

interest are displayed in the main menu and subtopics are displayed in the subset 
menu. The subset menu changes as the user steps through the main menu topics 
by typing the letter of the command. 

The user prepares the system for a task (such as measurement, simulation, or 
extraction) by selecting the appropriate command. If information is needed to 
perform the task (such as start, stop, and increment values), the user is supplied 
with a table or diagram to edit. Every table or diagram has a set of default values 
or configurations. The table entry method allows the user to have random access 
to the specifications of the task. Typical menu driven systems require the user 
to serially dredge their way through a series of menus, even when only one item 
is to be changed. The complete set of configuration information, for the entire 
system, can be saved or retrieved with a single command. This makes it possible 
for any user to customize the system by configuring it for a specific purpose 
and saving the configuration on a disc file; any other user can then reload the 
configuration and thus also have a customized turnkey system. 

The TECAP2 system can be operated without extensive knowledge of the 
computer hardware, the computer operating system, or the instruments. The 
system automatically adapts to handle whatever instmments are connected to 
the system (i.e., the user specifies capacitance measurements in the same way 
regardless of what specific model of capacitance meter is on line). 

The system outputs can be directed to external output devices, such as print- 
ers, plotters, external monitors, and external graphics displays. The directing 
of the outputs is controlled with the same menu commands as the other system 
commands. 
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Figure 2. TECAP2 Hardware block diagram. 



2.2 System Hardware 

The TECAP2 system combines the computing power of the HP9000 series 
200 PASCAL workstation with the measurement capabilities of a variety of ad- 
vanced instruments. The TECAP2 measurement hardware can be configured 
either as a benchtop system using the HP4145A Semiconductor Parameter An- 
alyzer for DC measurements and the HP4280A Capacitance Meter for CV char- 
acterization (Figure 2), or as a more sophisticated system using the HP4062A 
Semiconductor Parametric Test System hardware. The HP4062A system fea- 
tures the addition of an automatic connection matrix and a wafer prober. In 
addition, TECAP2 supports the Shared Resource Manager (a shared disc net- 
work) as well as an RS-232 data link to provide a communications interface to 
other systems. 

2.3 System Capabilities 

Measurement. A wide variety of measurements can be performed on a 
given device. Constant and swept stimuli may be applied to any terminal of the 
device under test for DC characterization. The resulting currents and voltages 
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at any of the device ports may be monitored. AC characterization is available in 
the form of capacitance versus voltage measurements. 

Simulation. Although similar to the SPICE circuit simulator in function, 
the TECAP2 simulator is restricted to a one-transistor simulator, designed to 
simulate the electrical characteristics of a four-terminal device model for com- 
parison with corresponding measured characteristics. The simulator is device 
model independent and works equally well for MOS and Bipolar type device 
models. The model structure used in TECAP2 is the same as that used by HP 
Modular SPICE [2]. This model structure greatly simplifies the task of imple- 
menting new models by consolidating all of the model routines. The TECAP2 
simulator will produce single transistor results that are identical to those ob- 
tained using a circuit simulator, providing the same model has been imple- 
mented in both TECAP2 and the circuit simulator. 

The simulator takes the same input instructions as the measurement part of 
the program. It is therefore easy to generate simulated device characteristics 
which can be graphically compared to the corresponding measured data. An 
interactive simulation mode allows the user to vary parameters, one at a time, 
and graphically display the resulting simulation in real time. This feature allows 
the user to quickly isolate the effect of an individual parameter on the simulated 
device characteristics. 

Extraction. The TECAP2 parameter extraction capability is model in- 
dependent, allowing use of the same system to model many different types of 
devices. TECAP2 supports a three-step parameter extraction methodology to 
extract device model parameters. This approach generates an accurate set of 
model parameters in a minimum amount of time, while still providing model 
independence. 

The first step in the parameter extraction methodology is to identify the re- 
gions of device operation to be modeled. Measured data can then be collected 
for each of these regions of operation. The model parameters can then be 
grouped into subsets that most directly affect the device characteristics in each 
of the selected regions of operation. The pertinent regions of operation and 
parameter groupings may be selected automatically for the supported models. 
However, the user has the flexibility to change these if desired. 

In the second step, initial estimates for the parameter values are calculated 
directly from the measured data. This step ensures that the final step will pro- 
duce a physically meaningful set of parameters in the shortest time. The initial 
estimates can be calculated automatically for the supported models. However, 
the user has the flexibility to select the initial values as desired. 

Finally, the TECAP2 optimizer is called to determine the parameter values 
that minimize the difference between the measured and simulated device char- 
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acteristics. The TECAP2 optimizer uses a nonlinear least squares fit algorithm 
(Levenberg-Marquardt algorithm [3]) in conjunction with user specifiable con- 
straints for each parameter. 

The models supported by TECAP2 are shown in Figure 3. Although these 



Model Name 


Description 


HP-MOS 


HP internal MOSFET model in HPSPICE [3] 


UCB-MOS (UCBl, 2, 3) 


3 levels of MOS model in Berkeley’s SPICE [4] 


CSIM 


New analytical/empirical model in SPICE [5] 


CLASSICAL-MOS 


A simple first order MOSFET model 


HPSPICE-BJT 


Gummel-Poon Bipolar model in HPSPICE/SPICE 


HPSPICE-DIODE 


The Diode model in HPSPICE/Berkeley’s SPICE 


PN-CAP 


P-N junction capacitance model 


MOS-CAP 


MOSFET gate structure capacitance model 



Figure 3. Available device models in TECAP2. 

models include the most widely used models at present, new models will con- 
tinue to be developed. Consequently, TECAP2 is designed to provide the user 
with the capability to implement new models and extraction modules. 

2.4 User Enhancements 

New user-defined commands which customize the system for special func- 
tions can be implemented via a user-linkable module. These commands will ap- 
pear in a TECAP2 command menu and may be used with the control structure 
like any other TECAP2 command. User defined models are also implemented 
through this module. A new model, when implemented, will automatically have 
the same simulation capability as TECAP2 supported models. Parameter ex- 
traction using the optimizer and the interactive simulation is also immediately 
available for new models. A fully automated extraction capability can be easily 
developed by creating a custom extraction module for the new model. 

3. TECAP2 APPLICATIONS 

The TECAP2 system was used to extract parameter values for five different 
models, in order to compare model performance. N-channel test devices ranging 
in masked channel length and width from 10.4 microns to 1.4 microns were 
characterized. The channel length reduction, LD, (due to processing effects) 
was about 0.35 micron and the channel width reduction WD was 0.55 micron 
resulting in an effective minimum geometry device of 0.3 by 0.7 micron. The 
objective of the parameter extraction was to obtain a single set of parameter for 
each model that would give the best possible fit over a range of different device 
geometries. The pertinent parameters are listed by functional category in Figure 
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4. Note that each model may also support other optional parameters that were 
not used here. 



Model 


HP-MOS 


UCBl 


UCB2 


UCB3 


CSIM 


Basic 


uo 


UO 


UO 


UO 


UN 


Parameters 


VTO 


VTO 


VTO 


VTO 


VFB 




NSUB 


GAMMA 


NSUB 


NSUB 


PHI 






PHI 






K1 

BETA{1} 


Mobility 


VNORM 




UEXP 


THETA 


uo 


Reduction 


ETRA 








UOJ 


Sub-threshold 






NFS 


NFS 




Channel 


WD 


WD{2} 


WD{2} 


WD{2} 


WD{3} 


Width Effects 


VWFF 

WFF 




DELTA 


DELTA 




Channel 


LD 


LD 


LD 


LD 


LD{3} 


Length Effects 


VDFF 

LFF 




XJ 


XJ 




External 


RS 


RS 


RS 


RS 




Resistances 


RD 


RD 


RD 


RD 




Velocity 

Saturation 


ECRIT 




VMAX 


VMAX 


U1 {4} 


Channel 


DESAT 


LAMBDA 


NEFF 


ETTA 


K2 


Length 

Modulation 








KAPPA 


ETA {5} 


Scale with L 


Good 


No 


Good 


Good 


Good 


Scale with W 


Good 


No 


AddWD 


AddWD 


Good 


Sub-threshold 


No 


No 


Good 


Good 


No 


Velocity Sat 


Good 


No 


OK 


Good 


Good 


Speed 


88 mSec 


17 mS 


180 mS 


71 mS 


29 mS 


RMS Error 


2.5% 


8% 


2.0% 


5.2% 


1.8% 


Max Error 


7.3% 


17% 


6.2% 


6.8% 


5.1% 



{1} Represented by five parameters. 

{2} Added to the TECAP2 implementation. 

{3} 34 parameters model width/length effects for CSIM. 
{4} Represented by 3 parameters. 

{5} Represented by 3 parameters. 



Figure 4- Functional grouping of model parameters. 



3.1 Parameter Extraction Process 

The model parameters were extracted using a total of four measurements (six 
for CSIM) on three different size devices. Each measurement typically consists 
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Figure 5. Comparison of the measured and simulated data after extraction of all parameters. 



of 20 to 100 data points. An average measurement and extraction of a complete 
set of model parameters takes less than five minutes. 

The “Basic” parameters and the “Mobility Reduction” parameters were first 
extracted using Id versus Vg data measured in the linear region on a large de- 
vice (10.4/5.4). The “Sub-threshold” parameter (if any) was then extracted from 
the low current region of the same measurement. The “Channel width” and 
“Channel length” parameters were extracted using Id versus Vg data on narrow 
and short devices respectively. The “Velocity Saturation” and “Channel Length 
Modulation” parameters were then extracted using an Id versus Vd measure- 
ment on the shortest channel length device. Figure 5 shows the Measured and 
simulated data after final extraction of the HP- MOS model parameters. 

The CSIM model requires a slight modification of the above extraction pro- 
cedure, since scaling of device characteristics with geometry is achieved empir- 
ically. For each model parameter shown in Figure 4, two additional parameters 
are used to describe the variation of parameter value with geometry. These ge- 
ometry dependent parameters are extracted from Id/Vg and Id/Vd measurements 
for three different geometry devices. 

3.2 Geometry scaling 

The extraction process described here extracts a complete set of model pa- 
rameters from three different size devices. These parameters are then used to 
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Figure 6. RMS error between measured and simulated data on different geometries. 



simulate other geometries. The simulation results can then be compared to the 
measured data in order to study how each model scales with geometry. Figure 
6 shows the RMS error between the simulated and measured data on differ- 
ent channel lengths and widths. The HP-MOS, UCB2 and UCB3 model show 
a reasonable accuracy in all geometries except when the channel width is 1.4 
micron (effective width of 0.3 micron). The UCBl model does not perform 
well, even for a large device. The CSIM model provides excellent accuracy near 
the geometries used to extract the parameter values. Larger deviations between 
measured and simulated results are encountered, however, for different geome- 
tries. This suggests that some parameters do not scale well using a 1/L and lAV 
geometry dependence. 

4. CONCLUSIONS 

A thorough comparison of five different models shows that the HP-MOS, 
UCB2, UCB3 and CSIM models scale down reasonably well to less than one 
micron. More fundamental work however is needed to push these models to 
less than half a micron. Implementation of a new model (CSIM) was accom- 
plished in less than a week, demonstrating the expandability and flexibility of 
the TECAP2 system. All of the measurements, extractions, and comparisons 
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presented above were performed in a few hours, showing the effectiveness of 

TECAP2 as a modeling and characterization workstation. 
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Abstract 

A new transistor sizing algorithm, which couples synchronous timing analysis with convex opti- 
mization techniques, is presented. Let A be the sum of transistor sizes, T the longest delay through 
the circuit, and K a positive constant. Using a distributed RC model, each of the following three 
programs is shown to be convex: 1) Minimize A subject to T < K. 2) Minimize T subject to 
A < ^. 3) Minimize AT^. The convex equations describing T are a particular class of functions 
called posynomials. Convex programs have many pleasant properties, and chief among these is 
the fact that any point found to be locally optimal is certain to be globally optimal TILOS (Timed 
LOgic Synthesizer) is a program that sizes transistors in CMOS circuits. Preliminary results of 
TILOS ’s transistor sizing algorithm are presented. 



1. Introduction 

Given a synchronous MOS circuit of the form shown in Figure 1 with N 
transistors of sizes (channel widths) xi, X 2 , x^, the following question is 

considered: How can the circuit’s performance be improved by adjusting the 
xp. Two figures of merit are of special interest. T is defined to be the minimum 
clock period at which the circuit will operate. The other quantity, A, is simply 
the sum of transistor sizes. A is positively correlated with a number of other 
attributes of the circuit that should be minimized or constrained: These include 
silicon area, capacitance-discharge power, short-circuit power, and probability 
of a device failure within a chip. 

TILOS (pronounced tee-los) is a program that requires a transistor connec- 
tivity file and I/O-delay file, and adjusts transistor sizes and connectivity within 
logical gates to meet the user’s requirements for A and/or T. TILOS ’s output is 
a transistor connectivity file that can be passed to a layout program such as SC2 
[5]. 

TILOS contains a static timing analyzer which recognizes latches and thus is 
capable of extracting all relevant timing paths from a circuit of the form shown 
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INPUTS 




Figure 1. Memory /combinational-logic model of digital MOS circuits. A path begins at the 

output of a latch or an input, and ends at the input of a latch or an output. 



in Figure 1. This recognition process is similar to that used in other static timing 
analyzers [1] [10] [6], and will not be further described here. 

Several authors ([13] [3] [4] [7] [9]) have reported on optimization techniques 
for transistor sizing. Contributions of this work over previous approaches in- 
clude: 1) Using a distributed RC model of delay, T is proved to be a convex func- 
tion of the transistor sizes. T remains convex if wire widths are also considered 
to be variables. 2) TILOS couples static timing analysis with transistor sizing, 
relieving the user of the need to specify which paths are to be optimized. Rather, 
the user specifies desired behavior of I/O signals, including clocks, and TILOS 
determines what paths are in need of improvement. 3) Transistors in latches as 
well as combinational gates are sized. 4) TILOS sorts series-connected subnet- 
works in a complex gate so that the subnetworks with earlier-arriving inputs are 
closer to the power supply. This heuristic allows transistor sizing a chance to 
operate: A transistor with an earlier-arriving input can be made larger and still 
have time to turn on, despite the increased gate capacitance, before other inputs 
arrive, providing a lower-resistance path to power. In addition, the 

increased source & drain capacitance of the larger transistor helps, rather than 
hinders, the output transition. 

2. Three Formulations of the Transistor Sizing 
Problem 

Let K be a positive constant, and let A and T be as defined above. Consider 
the following three optimization programs: 

1 Minimize A subject to the constraint T < K. 

2 Minimize T subject to the constraint A < K. 

3 Minimize AT*' . 
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The first formulation can be used when the circuit must fit inside a system 
with a given clock period K. The second formulation might be used when the 
circuit is to be made as fast as possible, subject to constraints on silicon area, 
power, or yield. The third formulation represents a wholistic approach in which 
both A and T are important, but have relative weights assigned to them. 

3. MOSFET Model 

The MOSFET model that TILOS uses is shown in Figure 2. The gate, source, 
and drain capacitances are all proportional to the transistor size X, and the 
source-to-drain resistance is inversely proportional to X. 




GATE 



m 



DRAIN 




SOURCE 



Figure 2. The source, drain, and gate capacitances are proportional to transistor size. The 
effective resistance is inversely proportional to transistor size. 



4. RC Delay Model 

Figure 3 illustrates the modeling of gate delay by a distributed RC network 
[1 1] [8]. In the RC network shown, an upper bound for the discharge time is 

(/?1 +/?2)C2 + (Ri +R2 + ^s)C3. (1) 



This represents a much tighter bound than a lumped R and C model. 

With the MOSFET and RC-delay models, the delay through any series con- 
figuration of transistors can be expressed in terms of the transistor sizes and 
routing parasitics. The important point here is the form of (1) when' expressed 
as a function of the transistor sizes x, : Each /?,• is proportional to 1/x/, and each C,- 
is some constant (for wire capacitance) plus one term for each transistor whose 
gate, drain or source is connected to the node. This term is proportional to the 
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Figure 3. Delay through pulldown network of NAND gate is modeled with RC network. 



transistor size. Thus (1) can be rewritten as: 

{A/xx -\-Ajx 2 ){Bx 2 + CxT, +D) + {A/x\ +Ajx 2 +Alx 2 ){Bx 2 +E) (2) 

where A, B, and C are constant coefficients for resistance, drain and source 
capacitance, respectively, and D and E are wire capacitances. It is interesting 
to note that if wire widths, as well as transistor widths, are treated as variables, 
then the expression for delay remains in the same form as ( 2 ). 



5. Delay Through Complex Gate 

For each transistor in a pulldown or pullup network of a complex gate, the 
greatest resistance path from the drain to the gate output is computed, as well 
as the greatest resistance path from the source to a supply rail. Thus for each 
transistor, the network is transformed into an equivalent series configuration, 
and the calculation of the previous section can be applied. 

6. Delay Through a Single Circuit Path 

Every circuit path has two path delays: one for the case where the input to 
the path is rising, the other for the falling input. Since a path delay is simply a 
sum of gate delays as in ( 2 ), the general form of a path delay is as follows: 



N 



I 

u=i 



N 



(=1 



h 

Xi' 



(3) 



where the a , 7 and bi are nonnegative constants that are mostly zero. The function 
which ( 3 ) describes is convex, which means that any straight line segment in 
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N+1 -dimensional space whose endpoints lie in the graph of the function is itself 
entirely on or above the graph. 

7. Delay Through Entire Circuit Is a Convex 
Function of the jc, 

The delay T through a single combinational block is defined as the maximum, 
over all possible paths through the block, of expressions of the form (3). Since 
convexity is preserved under sums and maxima, T is thus a convex function of 
the variables jc/. As a consequence, all three formulations considered in section 
2 are the minimization of a convex function subject to an upper bound constraint 
on another convex function. This implies that any point found to locally min- 
imize the objective function also globally minimizes it. The field of Convex 
Programming [12] has been intensively explored over the last several decades, 
and many techniques are available. In addition, (3) belongs to a special class of 
functions called posynomials, and the more specialized techniques of Geomet- 
ric Programming [2] can be used. Techniques such as simulated annealing or 
multiple random starts are unnecessary. 

8. Tailoring Convex Programming Techniques to 
Transistor Sizing 

Since we are interested in applying the sizing algorithm to circuits as large 
as an entire VLSI chip or perhaps to an entire system, the algorithm must be as 
efficient as possible, perhaps at the expense of absolute accuracy in finding the 
optimum. In our experience, the algorithm described below provides an efficient 
method for converging to the optimum point. 

TILOS proceeds as follows; Starting with minimum sizes for all transistors 
a static timing analysis is performed on the circuit, which assigns two numbers 
to each electrical node: ti (latest time to go low), and th (latest time to go high). 
From each path output that fails to meet its timing goals for ti or th, TILOS 
walks backward along the failing path. Whenever a node X is visited, TILOS 
examines in turn each NFET (if X’s ti is failing) or PFET (if X’s th is failing) 
which could have an affect on the path. In general, this includes both the critical 
transistor (the transistor whose input is on the critical path), and supporting 
transistors (transistors along the highest-resistance-to-power-supply path from 
the source of the critical transistor). TILOS calculates the sensitivity of each 
such transistor i, which is the time savings accruing per increment of x,-. The 
size of the transistor with the largest sensitivity is increased, and the process is 
repeated. 

Figure 4 illustrates a series configuration in which the critical path extends 
back along the gate of the top (critical) transistor. The sensitivity calculation for 
this transistor is derived as follows: Fix all transistor sizes except x, the size of 
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Figure 4- The critical path extends back along the gate of the top transistor. 



the critical transistor. R is the total resistance of an RC chain driving the gate. C 
is the total capacitance of an RC chain being driven by the configuration. Then 
the total delay D(x) of the critical path is 



D{x)=K + RCuX+~, 

where /?„ and C„ are resistance and capacitance of a unit-sized FET. K is a 
constant that depends on the resistance of the bottom transistor, capacitance RC 
chain, and resistance in the driven RC chain. The sensitivity D'{x) is then 



D'{x)=RCu-^. 

x^ 

The sensitivity calculation for supporting transistors is done in a similar way. 
When D'(x) is set equal to zero, the resulting value of x, which minimizes delay, 
is equal to a constant times 

V^- (4) 

An interesting consequence of (4) occurs in the special case when all inputs 
are equally critical. Since the quantity C includes capacitance of all sources 
and drains of FETS higher in the series configuration, the transistor sizes that 
produce minimal delay are smaller near the output, and larger near the power 
supply. This is similar to the “pyramid shaped” nand gates of Shoji [14]. 

The sizing process terminates when either the constraints are met, or when the 
circuit has passed its absolute minimum and is getting slower instead of faster. 
Since the number of paths through a circuit can be very large in comparison to 
the size of the circuit itself, the optimization is performed without ever actually 
keeping track of a sensitivity (Lagrange Multiplier) for each critical path, or 
even enumerating all paths. 
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9. Some Preliminary Results 

TILOS currently consists of 2705 lines of C, Lex, & Yacc source code. De- 
sign started in January, and coding in March of 1985. Recently sorting of series 
subnetworks, recognition of latches, and sizing of transistors within latches were 
added. The following table sunraiarizes TILOS ’s performance on 4 sample cir- 
cuits. The first 3 are 2, 8, and 32-bit adders, and the fourth is a standard-cell 
finite-state machine with 2 static flip-flops. The table gives the number of tran- 
sistors, T (ns) and A (microns) before and after sizing, and the number of sec- 
onds of TILOS execution on a 68000-based workstation. 



Speedup Due to Transistor Sizing 


Name 


FETs 


Unsized 


Sized 


Ex. sec. 


Tns 


A mic. 


Tns 


A mic. 


add2 


56 


14 


210 


8 


278 


6 


add8 


224 


42 


840 


18 


986 


49 


add32 


896 


154 


3360 


58 


3609 


1424 


fsm 


180 


30 


675 


8 


744 


97 



10. Directions for Future Research 

The major source of error in the distributed RC timing model is the lack of 
consideration for slowly rising inputs. Unfortunately, it has not been possible 
to prove, as it was for the distributed RC model, that delay through the circuit 
is a convex function of the transistor sizes when the input waveform shape is 
taken into account. Although there are several static timing analyzers and a 
transistor sizer [9] that take into account input waveform shape, we hesitate to 
do so without a convexity proof in hand. If a more accurate model turns out 
to be non-convex, there is always the danger that the optimizer might become 
trapped in a local minimum that is not a global minimum, resulting in a more 
pessimal solution than the less accurate model. 
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Abstract 

SPECS2 is a prototype implementation of a new timing simulation and modeling methodology. 
SPECS2 (Simulation Program for Electronic Circuits and Systems 2) is a tree/link based, event- 
driven, timing simulator. A modeling technique, which is predicated on the conservation of 
charge and energy, is employed to produce table models for device evaluation. The tables may be 
constructed to model devices at any desired level of detail. Thus, SPECS2 is a variable accuracy 
simulator. Grossly differing accuracy requirements may be specified for different runs and also 
mixed over different parts of the same circuit. SPECS2 implements a novel oscillation detection 
and suppression scheme that prevents algorithmic oscillation, while leaving real circuit results 
undistorted. SPECS2 takes advantage of the tree/link formulation of the circuit equations to 
provide a formal and general approach to timing simulation. It encounters no special problems 
with floating capacitors or transmission gates. Further, SPECS2 provides the framework for a 
generalized macromodeling and simulation capability. 



1. Introduction 

Timing simulators [1, 2, 3, 4, 5, 6] seek to sacrifice accuracy for efficiency in the 
simulation of large digital integrated circuits. The following observations can be 
made about existing timing and circuit simulators. 

■ The cost-accuracy tradeoff is not continuous and often not in complete 
control of the user. Most simulators are either quick and dirty or excruci- 
atingly slow and very accurate. There is no simulator that accommodates 
both simple models for first cut simulation as well as more complex mod- 
els for further refinement. 

■ To guarantee reliability (convergence), most simulators require unneces- 
sarily complicated models, often resulting in poor efficiency. Simplistic 
models often are not allowed. There is usually no flexibility in the varia- 
tion of modeling complexity. 
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■ Many modem simulators are specialized to handle only MOS circuitry 
and lack the capability to handle circuits of a more general nature. 

With a view to overcoming some of the above limitations, a prototype tim- 
ing simulator called SPECS2 (Simulation Program for Electronic Circuits and 
Systems 2) has been developed. It is an event-driven, tree/link based, timing 
simulator that uses table models for device evaluation. The accuracy of the 
modeling is user-specified and may be varied over different parts of the same 
circuit and over multiple mns. SPECS2 handles a wide range of models from 
very simple to very complex. The simulation strategy is not tied to the nature of 
MOS digital circuitry and may be used for bipolar and general analog circuitry 
as well. SPECS2 uses a novel oscillation prevention method for circumventing 
spurious algorithmic oscillation. Variable accuracy modeling is allowed and is 
based on the conservation of charge and energy. 

2. Overview of modeling and simulation 
2.1 Modeling 

Most simulators tacitly assume that the polynomial order of the variation of 
currents and voltages between computed time points is the same. However, for 
a linear capacitor, the variation of the current must be one polynomial order 
lower than that of the voltage in order to conserve charge. Bearing this fact in 
mind, the modeling in SPECS2 is based on the following assumptions. 

1 All branch voltages in the circuit are approximated to be piecewise linear 
in time. All branch currents are piecewise constant in time. 

2 Kirchhoff’s current and voltage laws are satisfied at all times, at the ex- 
tremities and in the interior of all piecewise segments. 

3 The single table (constant) current / of a two-terminal device represented 
by i = /(v) in the voltage range v/ to v/ is given by the formula 



The following conclusions may be drawn from the above assumptions. 

1 Since all branch voltages are piecewise linear and branch currents piece- 
wise constant in time, charge is always conserved in every linear capacitor. 

2 Since all currents and all voltages in the circuit are of the same polynomial 
order of variation, respectively, and since KCL and KVL are guaranteed 
to hold always, Tellegen’s theorem guarantees the conservation of energy. 
These two assumptions are the same as those made in [7]. 




( 1 ) 
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Figure 1. i-v characteristics are represented by a stair step function. 



3 For a linear resistor, the branch constitutive relation, Ohm’s Law in this 
case, is violated by the above assumptions. Hence, the device character- 
istics are forced to be represented by a stair step function in the i-v plane, 
as shown in Fig. 1. This approximation leads to a timing error incurred in 
the process of simulation. However, if the resistor is modeled using the 
formula of (1), the timing error is only second order [8, 9]. 

2.2 Simulation 

The timing simulation in SPECS2 is based on a tree/link formulation of the 
circuit equations. The tree consists only of capacitors and independent volt- 
age sources. The remaining branches form the links and are represented by 
stepwise models in the i-v plane (multi-terminal elements are represented by n- 
dimensional stepwise table models). The existence of such a tree is assumed. 
However, this is not a limiting assumption. 

An “event” is said to occur whenever a (possibly nonlinear) element reaches 
a comer of its stair step model. Then, a new branch current is found for the 
element by look-up in its associated table. The current distributes over tree 
branches as dictated by KCL. Constant capacitor branch voltage slopes are 
summed algebraically around fundamental loops to predict the next event times 
of the links in their loops. The tree/link partition determines a priori the topolog- 
ical extent of the propagation of an event. Any two links that share a capacitor 
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in their fundamental loops are said to be in each other’s “sphere of influence.” 
Whenever one of them has an event, the other is affected. The effect manifests 
itself as the rescheduling of events associated with all the links in the “sphere 
of influence.” The entire simulation algorithm consists of this event processing 
until the simulation interval elapses. 

3. Oscillation prevention 

Discretization in device models causes the device current to be approximated 
by a value that is almost always a little too high or too low. Near equilibrium, 
the computed value hovers back and forth around the correct steady state value. 
The result is a limit cycle, causing spurious algorithmic oscillation. While the 
irregular waveforms produced may be acceptable, the loss in efficiency due to 
the repeated and unnecessary scheduling and processing of events is not. Many 
timing simulators, such as SPECS [4], MOTIS [3] and E-LOGIC [5] face a simi- 
lar problem. SPECS2 uses a novel and effective method for preventing spurious 
oscillation. The advantages of this method are that no a priori knowledge of the 
steady state is required (unlike in [3]), real oscillation is not suppressed (which 
is an obvious disadvantage of a cycle detector) and convergence is guaranteed 
irrespective of the model coarseness. The oscillation prevention scheme never 
allows a device to cross its estimated steady-state (or “pseudo steady-state”) 
value. Whenever possible, a device is pushed into a “pseudo steady-state” and 
it leaves the event queue. Subsequent scheduling of topologically related events 
could, of course, cause this element to reenter the active event queue. 

This oscillation prevention algorithm has been generalized to multi-terminal 
elements. Thus, a formal and general scheme has been developed to tackle the 
problem of spurious oscillation. This scheme is one of the important features of 
SPECS2. 



4. Computational efficiency 

Many means of improving computational efficiency have been implemented in 
SPECS2. 

■ Sparsity: The cutset and loop linkages are stored as sparse matrices to 
exploit spatial sparsity. 

■ Fast incremental KCL: When an event occurs, the changes in the cur- 
rents of the corresponding tree branches may be computed without repeat- 
ing KCL through the cutset. 

■ Fast incremental KVL: The new dvjdt around a loop when an event oc- 
curs may be found in terms of the old dv/dt, the change in the link current 
and the loop matrix entry, rather than repeating KVL on a member-by- 
member basis. 
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Figure 2. ‘Twin-T” RC circuit. 



■ The event heap: The inserting, scheduling, rescheduling and deleting of 
events could be a high overhead rider on the SPECS2 algorithm. To de- 
crease this overhead, the events are stored in a “priority queue” or “heap” 
[10], that makes the worst case operation time 0 (log 2 n) in the number of 
events, with the complexity typically being more favorable. 

5. Results of simulation 

SPECS2 runs on a VAX' 11/780 or station H running 4.3 BSD UNIX^. It 
has been run on numerous circuits containing resistors, diodes and MOSFETS. 
The results of the simulation of some circuits are presented in this section. To 
demonstrate accuracy, the ‘Twin-T” RC circuit of Fig. 2 (with floating capac- 
itors) has been selected. The results of simulation with 100 mV segments in 
the resistor models are shown in Fig. 3. SPICE waveforms are in dotted lines 
for comparison. Fig. 4 shows the delay of a signal through a set of 6 inverters, 
with SPICE for comparison on the same circuit. Fig. 5 shows an XNOR gate 
and the results of simulation are shown in Fig. 6. For these small circuits with 
5 to 50 segments on each of the device model axes, SPECS2 is about an order 
of magnitude faster than SPICE. When very accurate models with more than 75 
segments on each axis are employed, the run times tend to become comparable 
to those of SPICE. SPECS2 is still experimental and has not been optimized for 
speed. 
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Figure 3. Simulation of “Twin-T” RC circuit. 




Figure 4- 



Delay of signal vin through 6 inverters. 
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Figure 6. Simulation of XNOR gate. 
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6. Conclusions 

A new timing simulator has been presented that is event-driven, uses table mod- 
els and is based on tree/link circuit equation formulation. Models are built based 
on the conservation of charge and energy. SPECS2 is a variable accuracy simu- 
lator. A novel oscillation prevention method has been implemented that weeds 
out only spurious oscillation and is guaranteed to provide convergence. A circuit 
theoretic basis is used in the event processing and oscillation prevention algo- 
rithms. Simulation results, even with very coarse device models, are encourag- 
ing. Some of the ongoing and future projects involving SPECS2 are described 
in the next paragraph. 

The incorporation of nonlinear capacitors and a larger repertoire of devices 
such as bipolar transistors is under investigation. Efficient handling of systems 
with large variations in time constants must be worked out. “Illegal” loops and 
cutsets must be handled. It is possible to have a hierarchy of device models 
at different accuracy levels. An adaptive error control scheme would dynami- 
cally decide which model would be used, depending on how close the concerned 
link was to its steady state. A generalized macromodeling capability with user- 
defined models is another future goal of the SPECS2 simulator. 
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Abstract 

An automatic synthesis tool for CMOS op amps (OPASYN) has been developed. The program 
starts from one of a number of op amp circuits and proceeds to optimize various device sizes and 
bias currents to meet a given set of design specifications. Because it uses analytic circuit models 
in its inner optimization loop, it can search efficiently through a large part of the possible solution 
space. The program has a SPICE interface that automatically performs circuit simulations for the 
candidate solutions to verify the results of the synthesis and optimization procedure. The simula- 
tion results are also used to fine-tune the analytic circuit descriptions in the database. OPASYN 
has been implemented in Franz Lisp and demonstrated for three different basic circuits with a 
conventional 3 pm process and a more advanced 1,5 pm process. Experiments have shown that 
OPASYN quickly produces practical designs which will meet reasonable design objectives. 



1. Introduction 

In recent years, rapid advances have been made in automating the design of 
analog and mixed analog/digital circuits [1, 2, 3, 4, 5, 6]. The general approach 
is to use building blocks which may be stored in the form of parameterized 
generators or as entries in macrocell or standard cell libraries. While the us- 
age of libraries of predefined building blocks can shorten the design period, it 
cannot give an optimal design for every application. Furthermore, the library en- 
tries become obsolete each time the technology or the design rules are modified. 
Generators that are operating at the circuit or symbolic level are more flexible 
and can be useful over a much larger domain. 

Among various building blocks used in analog systems, an operational am- 
plifier (op amp) is one of the most widely used circuit components. An efficient 
design of optimal op amps is thus a corner-stone of a design environment for 
many applications. Several methods concerning the automated design of op 
amps have been published [1, 6, 7, 8, 9]. An optimization-based approach is 
based on algorithmic optimization and circuit analysis techniques. It is appli- 
cable to a broad range of analog circuits and produces near optimal solutions. 
However, this approach is costly in CPU time and has difficulty in properly tun- 
ing the system according to the designers’ needs. Alternatively, an expert system 
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can be used to store human designer’s knowledge. But the large search space 
resulting from the many degrees of freedom in the design of op amp circuits 
makes this approach difficult and inefficient. 

This paper introduces a practical intermediate approach to automating op amp 
synthesis. It is based on analytic circuit models of the op amp circuit topologies 
covered by the synthesis system. It follows the general approach often taken 
by human designers, but automates the tedious and computationally extensive 
aspects of the design process. Based on the applications, a suitable circuit topol- 
ogy is selected, and its device sizes and bias currents are then adjusted in an 
iterative manner until the circuit optimizes some selected targets. When a rea- 
sonable design configuration has been found, it is subject to extensive circuit 
simulation to verify that indeed all the design specifications are met. This whole 
process has been automated. 




Possible Interaction 



Figure 1. Organization of the OPASYN system. 

As shown in Fig. 1 our op amp synthesis system (OPASYN) consists of a 
topology database, an optimization module, an interface to the circuit simulator, 
SPICE, and a parameter update module; the latter three are circuit and tech- 
nology independent. The topology database stores analytic circuit performance 
models for each of the adopted circuit topologies so that the design parameters 
can be calculated without the aid of a circuit simulator. The optimization module 
improves the optimality of the design with an algorithmic search procedure. The 
SPICE interface automatically simulates the chosen circuit and determines its 
exact performance parameters, thus verifying the synthesis results. The SPICE 
simulation summary is also used by the parameter update module to improve the 
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analytic circuit models in the database. A modular system configuration makes 
it easy to update OPASYN’s database as device technology changes or better 
circuit models become available. 

2. Circuit Synthesis Based on Analytic Circuit 
Models 

At the heart of OPASYN is an efficient circuit optimization module. The op- 
timization process relies on analytic circuit models which contain analog circuit 
designers’ expert knowledge about each op amp circuit topology and the de- 
pendencies between device sizes, bias currents, and performance characteristics 
of the overall circuit. Thus the inner loop of the optimization process can run 
much more efficiently using the explicit dependencies in these models rather 
than performing a circuit simulation after each optimization step. In addition, 
this analytic modeling of a given op amp circuit topology produces a simple and 
smooth solution space so that the optimization procedure can use faster - but 
less robust - search procedures. In OPASYN, a simple steepest-gradient descent 
method has been used to explore large portions of the solution space within a 
short period of time. 

Each of the analytic circuit models contains a netlist for the corresponding 
circuit topology, a declaration of independent design parameters for the cir- 
cuit, and a reasonable range of values for these parameters. On top of these 
basic properties, the model stores analytic expressions to compute circuit per- 
formances. These expressions were derived by using first-order circuit analysis 
techniques and topology-specific approximations [13]-[17]. For most dc char- 
acteristics these computed approximations are excellent. For highly non-linear 
specifications such as gain, phase margin, and settling time of the circuit, fitting 
parameters have been introduced to obtain more accurate predictions of specific 
performance characteristics. One example of such analytic expressions is 



_ ^ _ 8m28m6 

gain -I- go 4 ) (go6 + 8oi) 



where a^, is the small signal dc gain of the circuit topology in Fig. 2, gm is a 
transconductance, go is an output conductance of a transistor, and c/g^jjj is a 
fitting parameter. All the conductances are dependent on transistor sizes and 
bias currents. The fitting parameters are being updated as the system acquires 
more information from repeated synthesis and verification steps. 

Based on these equations and the user-defined design targets, a cost function 
is computed which represents a relative figure-of-merit for any particular com- 
bination of design parameter values. In the present version, the cost function 
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(C) is formed as follows: 



C = £(exp(± 



w,- ( spec,- -Pi) 
spec,- 



-1 



where n is the number of circuit performance parameters considered in the pro- 
gram, Pi is the j-th performance parameter, wi is a relative design priority of 
Pi and spec,- is the corresponding design specification. The plus sign is chosen 
when meeting the specification requires the value of p/ to be greater than that 
of spec,-. The cost function produces a smooth search space where the gradient 
descent method works effectively. The exponential nature of the cost function 
prevents the penalty for violating any specification from being compensated by 
overly satisfying other specifications. 

The search for an optimal solution starts with a coarse-grid sampling through 
the entire parameter space fine enough to yield a starting point in each cost 
function well. From the more promising sampling points found in this survey, 
a steepest-gradient descent algorithm searches for the optimal solution in this 
neighborhood. OPASYN will often return several solutions if many different 
locally optimal solutions are found that come close enough to the stated design 
targets. 



Vdi 




Figure 2. Basic Two Stage OP Amp. 

The described optimization algorithm is independent of the specific device 
technology or circuit topology used. However, when a new circuit topology 
needs to be introduced into OPASYN, the corresponding analytic circuit models 
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Vm 




Figure 3. Single Stage Folded Cascode OP Amp. 



Vm 




Figure 4- Two Stage OP Amp with Cascoded First Stage. 



must first be created. This is a non-trivial task that demands the attention of a 
good analog circuit designer. 

In addition to creating the proper expressions that describe the functional 
dependencies in a new circuit model, reasonable values for the various fitting 
parameters must be determined; this requires a substantial number of simulation 
runs. Any discrepancy between the predictions and the results of the circuit 
simulations can be used to fine-tune these fitting parameters and to improve the 
accuracy of the predictions. Thus the more the system is used, the more accurate 
it gets because of the ‘experience’ gained from previous design tasks. 



318 



THE BEST OF ICC AD 



3. Automatic Verification of Synthesis Results 

After the device parameters have been determined in the optimization phase, 
the resulting circuits) need to be carefully checked against the given design 
specifications. This is done with a series of SPICE simulation runs to accu- 
rately determine the various performance characteristics. At this stage it is also 
possible to study the influence of processing variations and device tolerances. 

OPASYN’s SPICE interface module generates the needed SPICE input files, 
makes the necessary system calls to execute the various simulations, interprets 
the SPICE output files to determine the various performance characteristics, and 
compares the latter against the given design targets. OPASYN then supplies a 
summary of these simulation results to the user. It also utilizes these results to 
improve the fitting parameters in the analytic models in the database. 

One of the difficulties with SPICE simulation of analog MOS circuits is to 
achieve dc convergence. Designers often spend a considerable amount of time 
trying to find the right initial conditions and control parameters to be able to 
execute a particular simulation successfully. To alleviate this problem and to 
automate the verification process, the simulation runs are started with suitable 
initial node voltages. These initial node voltages are obtained from a SPICE 
simulation of a slightly modified circuit topology which has much better dc 
convergence properties. We have also investigated the effects of various control 
parameters in SPICE on dc convergence and used the most promising combi- 
nation of these parameter values. This method was successful in most of the 
design examples we have tried. In case OPASYN fails to successfully complete 
the SPICE verification, it generates the necessary SPICE input files and let the 
user complete the verification work and run the parameter update module (see 
Fig. 1). 



4. Implementation and Results 

All of the OPASYN modules shown in Fig. 1 have been implemented in Franz 
Lisp and are running under an Ultrix X2.0-3A system on a VAX 8800. The 
topology database contains the analytic circuit models for three of the most fre- 
quently used op amp circuit topologies [14, 15] shown in Figures 2-4. It also 
contains expert analog designers’ conventional design rules in the form of vari- 
ous relations that must hold between certain circuit parameters. 

The OPASYN technology database currently contains the SPICE device pa- 
rameters for a conventional MOSIS 3 jum process and the more advanced GE 
1.5 ;un process. As demonstrated with Tables 1 to 3, OPASYN has found good 
design configurations for the three examples shown where a wide range of user 
specifications and optimization priorities are applied. It can also be seen that the 
predictions from the analytic circuit models are in rather good agreement with 
the SPICE simulation results. For dc characteristics, excellent agreement is gen- 
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erally achieved. For ac and transient characteristics such as phase margin, gain, 
and settling time, there are some differences, but all these deviations are less 
than 20%. If the user’s design objectives are too demanding to be met, the pro- 
gram will provide the user with the best result. In the example shown in Table 
2 the designer specified a very fast settling time of 100 nsec. The optimization 
procedure tried to comply as much as possible and ended up in 160 nsec. 

CPU time for the synthesis phase varies with the difficulty of the user speci- 
fications and the degree of optimality to be achieved. The range of CPU times 
observed for the synthesis phase of the interpreted version of OPASYN is from 
70 to 280 seconds on a VAX 8800. The SPICE verification phase typically re- 
quires about 200 seconds. 

5. Conclusion 

An efficient CMOS op amp synthesis tool (OPASYN) has been developed. 
It uses analytic circuit models to estimate circuit performance during the search 
for the optimal solution. Our experiments have shown that this approach greatly 
reduces the required CPU time while still producing near-optimal results. 

OPASYN has been applied successfully to three of the most commonly used 
op amp circuit topologies using two different process technologies. The synthe- 
sis process has been reliable and produced good results. 

OPASYN can be easily used by engineers inexperienced in op amp design. 
It does not normally require any user intervention in the design phase. Users 
simply specify their design requirements and optimization priorities and select 
the most desirable result out of several options produced by OPASYN. 
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abbreviations used: 
Vdd 


positive power supply voltage 


Vss 


= 


negative power supply voltage 


CL 


= 


load capacitor 


Cc 


= 


compensation capacitor 


wu 


= 


unity gain bandwidth 


PSRR“@dc 


= 


Vss rejection ratio at dc 


PSRR"@lkilo 


= 


Vss rejection ratio at 1 kHz 


TS 


= 


settling time (IV step, 0.1 % interval) 


Vo,max 


= 


maximum output voltage 


V 

''o,nun 


= 


minimum output voltage 


V- 

''ic,max 


= 


maximum common mode input voltage 


V- 

^ic,min 


= 


minimum common mode input voltage 


1/f noise 


= 


input equivalent 1/f noise at 1 kHz 
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parameter 


specification 


synthesis 


SPICE 


Vdd 


2.5 V 


2.5 V 


2.5 V 


Vss 


-2.5 V 


-2.5 V 


-2.5 V 


CL 


10 pF 


10 pF 


10 pF 


gain 


10,000 


23,610 


24,250 


power dissipation 


ImW 


0.67 mW 


0.66 mW 


phase margin 


60 deg 


57.1 deg 


56.1 deg 


wu 


4 MHz 


4.7 MHz 


4.7 MHz 


gain margin 


none 


none 


15.1 dB 


CMRR 


80 dB 


none 


92 dB 


slew rate 


2.5 V //isec 


3.9 V//isec 


3.4 V//isec 


PSRR“ @dc 


70 dB 


94.8 dB 


94.8 dB 


PSRR-@lkilo 


40 dB 


58.0 dB 


74.3 dB 


TS 


500 nsec 


453 nsec 


550 nsec 


systematic offset 


none 


none 


0.008 mV 


Vo, max 


1.5 V 


2.38 V 


2.40 V 


V 

''o,min 


-1.5 V 


-2.33 V 


-2.39 V 


V* 

^ ic,max 


1 V 


1.45 V 


1.40 V 


V* 

^ ic,min 


IV 


-2.33 V 


-2.50 V 


1/f noise 


lE-6 V/v^Hz 


2.2E-7 


none 


total gate area** 




35.3 mil^ 


none 


Cc 


none 


4.8 pF 


4.8pF 



Table 1. Synthesis and verification results for the basic two stage op amp (shown in Fig. 2) 
with optimization priority given to total gate area. MOSIS 3 /mi process was used. CPU time for 
the synthesis was 175 seconds on a VAX 8800. 
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parameter 


specification 


synthesis 


SPICE 


Vdd 


2.5 V 


2.5 V 


2.5 V 


Vss 


-2.5 V 


-2.5 V 


-2.5 V 


CL 


2pF 


2pF 


2pF 


gain 


1,500 


1,421 


1,496 


power dissipation 


30 mW 


2.85 mW 


2.72 mW 


phase margin 


60 deg 


38.8 deg 


33.0 deg 


wu 


4 MHz 


42.3 MHz 


30.0 MHz 


gain margin 


none 


none 


15 dB 


CMRR 


none dB 


none 


132 dB 


slew rate 


8 V //xsec 


22 V/pisec 


19 V//isec 


PSRR" @dc 


40 dB 


122 dB 


112dB 


PSRR"@lkilo 


10 dB 


122 dB 


112 dB 




100 nsec 


164 nsec 


160 nsec 


systematic offset 


0.1 mV 


none 


0.01 mV 


Vo, max 


1.5 V 


1.86 V 


2.10 V 


V 

''o,mm 


-1.5 V 


-1.88 V 


-2.10 V 


^ic,max 


IV 


0.78 V 


1.50 V 


V- 

^ ic,mm 


IV 


-2.50 V 


-2.50 V 


1/f noise 


lE-6 Y/y/m 


1.5E-7 


none 


total gate area 


TOP 


75 mil^ 


none 


Cc 


none 


none 


none 



Table 2. Synthesis and verification results for the single stage folded cascode op amp (shown 
in Fig. 3) with optimization priority given to settling time. MOSIS 3 /mi process was used. CPU 
time for the synthesis was 278 seconds on a VAX 88(X). 
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parameter 


Specification 


synthesis 


SPICE 


Vdd 


2.5 V 


2.5 V 


2.5 V 


Vss 


-2.5 V 


-2.5 V 


-2.5 V 


CL 


5pF 


5pF 


5pF 


gain 


10,000 


15,140 


13,890 


power dissipation 


ImW 


0.85 mW 


0.86 mW 


phase margin 


60 deg 


65.3 deg 


62.1 deg 


wu 


4 MHz 


7.2 MHz 


7.2 MHz 


gain margin 


none 


none 


9dB 


CMRR 


none dB 


none 


95 dB 


slew rate 


2.5 V//xsec 


4.7 V/jLisec 


4.6 V//isec 


PSRR~ @dc 


70 dB 


90 dB 


90 dB 


PSRR-@lkilo 


40 dB 


90 dB 


90 dB 


XS** 


500 nsec 


387 nsec 


420 nsec 


systematic offset 


none 


none 


0.46 mV 


Vo, max 


1.5 V 


2.37 V 


2.39 V 


V 

^o,min 


-1.5 V 


-2.32 V 


-2.39 V 


^ic,max 


IV 


1.45 V 


1.50 V 


V- 

^ic,min 


IV 


-0.88 V 


-1.00 V 


1/f noise 


lE-6 V/\/Hz 


1.7E-7 


none 


total gate area** 


40 mil^ 


38.7 miP 


none 


Cc 


none 


3.4 pF 


3.4 pF 



Table 3. Synthesis and verification results for the two stage op amp with cascoded first stage 
(shown in Fig. 4) with optimization priority given to settling time and total gate area. GE 1.5 /mi 
process was used. CPU time for the synthesis was 82 seconds on a VAX 88(X). 
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Abstract 

This paper describes mechanisms needed to meet aggressive performance demands in a hierarchi- 
cally-structured analog circuit synthesis tool. Experiences with adding a high-speed comparator 
design style to the OASYS synthesis tool are discussed. It is argued that design iteration — the 
process of making a heuristic design choice, following it through to possible failure, then diag- 
nosing the failure and modifying the overall plan of attack for the synthesis — is essential to 
meet stringent performance demands. Examples of high-speed comparators automatically syn- 
thesized by OASYS are presented. Designs competitive in quality with manual expert designs, 
e.g., with response time of 6 ns and input drive of 1 mV, can be synthesized in under 5 seconds 
on a workstation.^ 



1. Introduction 

This paper considers how to attack real-world performance issues in auto- 
matic synthesis tools for analog circuits. OASYS is a tool for synthesizing sized 
circuit schematics from process and performance specifications; a companion 
tool, ANAGRAM transforms these schematics into layouts [1]. OASYS rep- 
resents circuits as a hierarchy of alternate design styles, and uses a planning 
mechanism to refine specifications down the hierarchy to primitive devices. De- 
tails of the OASYS synthesis framework, in particular this planning mechanism, 
appear in [2]. Examples of functional, fabricated CMOS op amps designed by 
OASYS appear in [3]. The purpose of this paper is to describe how synthe- 
sis in OASYS changed when a high-performance design style was added to the 
system. We focus on the mechanisms needed to meet aggressive performance 
demands in a hierarchical automatic synthesis tool. 

Many previous attempts at full-custom circuit synthesis have two essential 
characteristics: a simplified target domain, and a simplified model of design [4]. 
For example, most recent synthesis systems have used operational amplifiers as 
a trial domain [3, 5, 6, 7]. Unfortunately, many critical design constraints do 
not manifest themselves here: op amps are essentially linear circuits, can be 
analyzed primarily by using small signal models, do not have devices changing 
regions of operation during normal use, etc. Not all design problems have these 
characteristics, and it is precisely the need to identify, model and manage sub- 
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stantial non-linearities, transients, changes in operating region, etc., that is criti- 
cal in high-performance designs. In addition, many previous synthesis attempts 
model design as a rather simple process. At one end of the spectrum, tools such 
as IDAC [5] model synthesis as a simple sequence of equations, known a pri- 
ori, whose solution gives a complete design. When this scheme fails, the input 
specifications are tightened, even though it may be possible that an alternate 
plan of attack for the design might succeed. At the other end of the spectrum, 
tools such as OPASYN [7] model design with a complex set of coupled non- 
linear equations solved with aggressive numerical techniques. This approach 
can support complex performance requirements, but seems currently limited to 
those situations where all relevant aspects of analog circuit behavior can be cap- 
tured beforehand in a single set of static equations. Moreover, the CPU time 
required here is substantially greater, effectively precluding exploration of dif- 
ferent designs in the analog design space. In addition, these tools can generate 
only fiat, specified circuit topologies, further limiting opportunities to achieve 
higher-performance designs by changing topologies as necessary. 

In contrast, expert human designers explore design problems as a design 
evolves, change models or topologies, simplify as necessary, and iterate on their 
designs to converge to workable solutions. This paper argues that the notion of 
iteration, i.e., making a heuristic design choice, following it through to possible 
failure, then diagnosing the failure and modifying the plan of attack, is central 
to synthesis of tightly-specified, high-performance designs. This is the synthe- 
sis style followed by the planning mechanism in OASYS. We use experience 
gained from the addition of a high-performance CMOS comparator design style 
to OASYS to support his argument. The remainder of the paper is organized 
as follows. Section 2 describes the key concerns behind the design strategy 
for this comparator, and compares qualitatively some of the design decisions 
required by the new OASYS comparator style against those required by the ear- 
lier OASYS op amp styles. Section 3 presents results, in the form of several 
OASYS-synthesized comparator schematics, and simulation data. Finally, Sec- 
tion 4 presents concluding remarks. 

2. Design Strategy 

High-performance comparators were chosen as a test domain because, like 
op amps, they are widely used as building blocks, but unlike op amps, they have 
substantial non-linearities, transients, and so forth. The particular comparator 
design style we chose is a novel regenerative, latch-based comparator. This de- 
sign style, as are all OASYS styles, is explicitly hierarchical, and is completed 
by alternating selection of a design style with refinement of specifications down 
to the next, more primitive level. Hence, several different device-level topolo- 
gies can be reached from this single style. In OASYS, a design style is formally 
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Figure 1. Comparator Design Style. 



an interconnection of abstract blocks. Adding a design style to OASYS requires 
specifying this topology of blocks, and adding a design plan for refining overall 
performance specifications down to specifications for the component sub-blocks 
[2]. The design style for this comparator appears in Figure 1. 

The key difficulty for this high-performance comparator style is the construc- 
tion of the design plan. Roughly speaking, there are three requirements for pro- 
ducing such a design plan. First, we need detailed knowledge of which effects 
can be modeled, either analytically or numerically. Second, we need knowledge 
of where heuristic design decisions are required to advance the design (because, 
for example, there is insufficient information in the current design state to know 
with certainty the optimum decision). Finally, we need to identify and manage 
the degrees of freedom in the design problem. These are the explicit performance 
tradeoffs available in a specific design style; choices here are complicated by the 
fact that frequently there is no clear analytical model to prefer one choice over 
another. As design plans are executed to refine specifications, heuristic design 
decisions made earlier are often found later to be imprecise, requiring that earlier 
phases of the design be repeated. OASYS incorporates a rule-based failure di- 
agnosis and re-planning mechanism to identify failures, correct the overall plan 
of attack for the design, and restart the internal design plan at the appropriate 
point. 
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Figure 2. Small-Signal Model for the Latch. 



Because of space limitations, we omit a complete, analytical description of 
all the design steps in the comparator design plan; these appear in [9]. Instead, 
we focus on four key concerns that must be dealt with in any analog design plan: 

1 Simplifications versus subtle effects 

2 Compensating for parasitics 

3 Managing degrees of freedom 

4 Managing synthesis within a hierarchy 

and illustrate some important design decisions in the comparator design plan 
that cope with these concerns. 

2.1 Simplifications versus Subtle Effects 

Simplifications are essential to suppress unwanted details that obscure gen- 
eral insights about overall circuit behavior. But whereas in low-performance 
designs these suppressed details really can be ignored, in high-performance de- 
signs they must eventually be recovered and considered. For example, it is com- 
mon to neglect the channel length modulation (X) effect while calculating the 
transconductance (gm) of a MOS device, however, for large signal operation, 
e.g., large supply voltages, channel length modulation is no longer a negligible 
effect. With respect to our comparator design problem, an essential simplifi- 
cation involves the time domain response of the latch within the comparator. 
Figure 2 shows a simplified small-signal model for this latch. 

Even though this latch operates primarily as a “large-signal” circuit — be- 
cause it sees large voltage transients — the time domain response is dominated 
by the small-signal behavior. If we assume that the switch resistance {Rs) is 
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negligible, as is normally done, then a second order characteristic equation de- 
scribes the latch, and the time domain response is dominated by a single pole 
[10]. In our latch, we assume that the switch resistance Rs is non-zero, and 
contrary to common belief, this actually reduces the response time of the latch 
because the two critical capacitors, and Cioad in Figure 2, are decoupled by 
the switch resistances Rg. For the same voltage swing at the latch output node, 
the gate voltages need not vary as much as when Rs is assumed zero. Unfortu- 
nately, a fourth-order characteristic equation now describes this latch, seemingly 
precluding simple analytical models for synthesis. However, in the case that the 
circuit is totally symmetrical, it is possible to show [9] that the solution to this 
fourth-order system has a non-dominant real pole, a complex-ponjugate pair in 
the left-half plane (LHP) and a dominant pole in the right-half plane (RHP). 
Thus, the time domain response is still dominated by a single pole, but the so- 
lution incorporates the non-zero value for Rs. We employ this simplified model 
to suggest the “correct” avenue of attack when designing the latch, but return to 
the higher-order model to verify our design decisions. In addition, though this 
simplified model assumes that the latch response is dominated by one particular 
state, i.e., the regions of operation of each of the latch’s constituent devices, note 
that we must compensate for the fact that these devices actually change state as 
the latch switches. Thus, we incorporate heuristics to choose the right models 
dynamically, to meet desired design goals. In all cases, more complete models 
are used to verify the evolving circuit design after these heuristic choices. 

The process of selecting and validating such simplifications is central to the 
construction of a design plan within OASYS. While “designing” such simpli- 
fications, some effects are ignored not because they are difficult, but because 
they clutter models. These can be handled by switching from simple to complex 
models as needed during synthesis. More important there are subtle effects that 
really only appear when performance limits are pushed hard. A good example 
in our comparator concerns the analog switches within the latch sub-module. 
Charge that is stored in the channel of the switch transistors when they are con- 
ducting is dumped into the local circuit environment of the surrounding latch 
when the switches turn off. If this effects is neglected, we cannot accurately 
predict high-speed behavior. A simple model of this environment appears in 
Figure 3. 

Due to the (typical) assumption of minimal channel length for the switch de- 
vices, and finite rise and fall times for the clock, we can safely neglect charge 
injection into the substrate [11]. Due to the symmetrical nature of our latch, we 
can now assume that all the channel charge, accumulated during the switch’s 
conducting period, is equally distributed between the source and the drain ter- 
minals of the transistor Qs in Figure 2 during the turn-off transient. 
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Gate Vote 




Due to symmetry 




Figure 3. Switch Environment. 



2.2 Compensating for Parasitics 

Some subtle effects, such as the charge dump from the switches, can be pre- 
dicted adequately. In contrast, device parasitics, though extremely influential 
during synthesis, cannot be calculated until after devices are actually designed. 
Hence, compensating for parasitics in OASYS design plans almost always in- 
volves iteration, i.e., modifying and re-executing a design plan. Parasitics are 
estimated with heuristics, design proceeds, and if subsequent verification proves 
the estimate sufficiently wrong, the design process is reiterated. Specifically, 
whenever a step in an OASYS design plan is able to calculate an actual para- 
sitic, it also compares this value with any previously predicted value. If these 
values are sufficiently different, a heuristic is invoked to decide how many of 
the recent steps in the design should be redone, using this newer, more accurate 
parasitic estimate. In this way, sequences of plan steps can be iterated until the 
predicted and calculated values converge. Usually, if the sequence of design 
activities to be iterated is carefully constructed, predicted and calculated val- 
ues actually do converge. However, sometimes convergence is never reached, 
because the design is simply infeasible; it is impossible adequately to meet all 
performance specifications in the current circuit design style, using the existing 
models and heuristics. 

2.3 Managing Degrees of Freedom 

Another important component in the generation of a design plan for the com- 
parator is identification and management of the degrees of freedom in the de- 
sign. For example, one critical degree of freedom here is the area/power trade- 
off. With respect to our simplified models, constraints exist that limit the ex- 
tremal points in the spectrum of feasible area, power values, but many values 
in between are viable, if not necessarily optimal. Heuristics were developed to 
choose the best “bias” for this tradeoff when design decisions were required, or 
when failures were encountered during synthesis and the plan of attack needed 
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modification. In this comparator topology, higher speed is achieved primar- 
ily by manipulating bias currents, and to a lesser extent by increasing device 
sizes. Indeed, many of the devices tend to be designed small to minimize the 
parasitics they contribute (each of which may slow the comparator’s overall re- 
sponse time). As an example of managing this tradeoff, the design plan for the 
latch sub-module currently employs the simplified first-order model described 
in Sec. 2.1 to predict an approximate family of feasible (area, power) points 
meeting the desired response time. (Here, “area” estimates active device area 
only, i.e., device sizes, not routing area, etc.) The very simple models are at- 
tractive here since they can be used quickly to explore this area versus power 
surface. At this rather coarse level of detail, we bias the tradeoff in favor of 
minimum area, and for a specific area choice, use higher-order models to com- 
pute the required power. Unfortunately, (area, power) choices that look promis- 
ing with coarse models are not always valid when revisited with more accurate 
models. Our strategy here is to first try to increase the power gradually, keeping 
the area constant, until the response time specification is satisfied, or the power 
is deemed unreasonable large; if the power becomes too large, we increase the 
area slightly, and again search for a reasonable power value. Note that other 
strategies are feasible here, but this approach appears to work well. 

We refer to the process described above as “absorbing” a degree of freedom: 
the tool, and not the designer, determines the tradeoff. A goal for a mature, 
complete design plan is to identify and absorb all relevant degrees of freedom. 
Because the current OASYS comparator design plan is still immature, not all 
degrees of freedom are absorbed; some of these are temporarily promoted to 
the level of specifications for the designer to choose. Currently, a few tradeoffs 
related to the overall gain, and gain partition among the sub-blocks of the com- 
parator style, e.g., the gain of the input differential amplifier, are specified by 
the designer. However, our expectation is that with experimentation, adequate 
heuristics will be determined to manage these decisions internally. Indeed, a 
similar tradeoff, the gain partition between the stages of the OASYS two-stage 
op amp is absorbed internally, and its performance has improved as this op amp 
design plan has matured [9]. It is also interesting to note that other synthe- 
sis architectures, notably ID AC [5], routinely avoid all such tradeoff decisions, 
forcing the designer to specify manually many of these internal design tradeoffs, 
along with ordinary performance specifications. 

2.4 Managing Synthesis within a Hierarchy 

OASYS was the first analog synthesis tool to make extensive use of hierar- 
chy even for basic cells like op amps and comparators; subsequent tools have 
adopted this approach [12]. This hierarchy helps greatly to simplify the process 
of design for complex objects. Unfortunately, it also complicates attempts to 
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push hard on performance limits, because the flow of design information across 
boundaries of modules within the hierarchy is restricted and stylized. An ex- 
ample in the comparator occurs within the design plan for the latch. The latch 
needs to know the amount of charge dumped by the analog switches within it, 
but it actually does not know what topology will be used for these switches. In 
the OASYS hierarchy, the switches are a separate block, with their own inde- 
pendent design plan. To handle this, the latch computes the maximum allowable 
charge dump, invokes the design plan for the switches, and analyzes the predic- 
tion for charge dump returned by the switches. If the prediction is not within 
acceptable limits, the separate latch and switch design plans iterate to reach a 
mutually agreeable value. However, if it becomes apparent that the switches 
cannot possibly meet their specification, the latch design plan itself is deter- 
mined to have failed, and the overall comparator design plan will be modified 
and reinvoked to try to avoid this problem. 

Superficially, it may seem that the need to negotiate among sub-blocks and 
iterate toward convergence that is imposed by a strict hierarchy is more a burden 
than an advantage. However, we regard this a small price to pay; there are many 
striking advantages to this hierarchical formulation [2]. Most notably, design 
knowledge is encapsulated in the form of reusable sub-blocks. There is only 
one current mirror designer in OASYS, despite the fact that several op amp 
and comparator topologies require mirrors, and this mirror designer can design 
in any of several circuit styles. Moreover, design is decomposed into smaller, 
more easily managed tasks. For example, our comparator is not designed at the 
level of transistors, but rather, at the level of latches, differential pairs, etc., all 
specified by the behavior seen at their terminals. Finally, because all sub-block 
can be realized in several different design styles, a single high-level design style 
can be refined down to many distinct transistor-level circuit topologies. Thus, 
a small hierarchy of design plans can reach a very large population of different 
flat circuit topologies (avoiding the need for a large library of synthesis tools for 
each flat circuit schematic). 

3. Results 

Prior to this study, OASYS was capable of designing CMOS op amps in 
OTA and generic two-stage Miller-compensated styles. A wide variety of lower- 
level circuit styles, essentially used as building blocks, were also supported; 
current mirrors, differential pairs, level shifters, etc. Full implementation of the 
comparator style increased the size of OASYS by about 25%, to approximately 
10,000 lines of Franz Lisp. This required addition of designers (i.e., design 
styles and plans) for the comparator itself, as well as for two new lower-level 
blocks: analog switches and latches. 
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Figure 4 shows the specifications for a modest comparator, the OASYS- 
synthesized design, and results from a SPICE simulation for verification. This 
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Figure 4- Simple Comparator Synthesis Example. 

comparator has a response time of 65 ns, with an input drive of 0.5 mV and es- 
timated power consumption of 0.12 mW. Total design time was about 1 second 
on a micro VAX. All designs were simulated using models from the MOSIS 3 
jjm bulk CMOS process. Also included here is a plan trace for this synthesis. 
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Individual plan steps within OASYS have been uniquely numbered, and their 
execution plotted as a function of design time. These traces illustrate succinctly 
how designs evolve. For example, an “ideal” design plan appears as a mono- 
tonically increasing straight line: each plan step is visited exactly once, in the 
default order specified by each plan at each level of the hierarchy; no steps are 
repeated. However, in our experience, comparators are never so ideal. The plan 
trace shown in 4 suggests this was a fairly simple design problem: one pass 
through the default design plan, with only two plan corrections (the discontinu- 
ities in the trace) sufficed to meet specifications. 

The situation changes considerably when high-performance designs are at- 
tempted. Figure 5 shows a much more aggressive comparator designed for a 
response time of 5.75 ns and 1 mV input drive. Power consumption is estimated 
to be 1.01 mW. Total design time was about 3 seconds on a microVAX. This 
circuit’s performance compares favorably with the speed, resolution, and power 
tradeoffs of several published manual designs [13, 14, 15, 16, 17]. Notice that 
the topology has changed from that shown in Figure 4. A more complex topol- 
ogy for the analog switches was chosen to compensate for charge dumping. The 
plan trace is also significantly dilferent for this design: there is considerable it- 
eration and replanning activity. This synthesis has three different phases. First, 
obviously wrong attempts are made and rejected. Second, a good plan of attack 
evolves, but must be iterated extensively; the specific heuristic decisions made 
in individual plan steps are not quite right, and many alternatives are explored 
before the design converges. Finally, a complete retuning with respect to more 
accurate final parasitic estimates is attempted, and the design completes. 

Direct comparison with other analog synthesis systems is difficult, since none 
of which we are aware attack high-speed comparator design explicitly. For ex- 
ample, the ultra-low-power comparators designed by IDAC [5] appear to have 
response times around 0.5 jUS at best. The comparator used in the A/D designer 
AIDE2 [18] is essentially just a fixed standard cell, with roughly a 1 jUS response 
time. 



4. Conclusions 

From experience with adding a high-performance comparator style to OASYS, 
we observe that high-performance designs require substantial exploration among 
tradeoffs, i.e., design iteration, to find workable designs. A variety of high- 
performance comparators have been designed by OASYS and successfully sim- 
ulated. These results are significant because these designs can be synthesized in 
under once minute of CPU time, and yet appear to approach performance limits 
roughly comparable to manually produced expert designs. 

Current work is focused on extensive tuning of the comparator design heuris- 
tics (notably, dealing with unabsorbed degrees of freedom), broadening the com- 
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Figure 5. Aggressive Comparator Synthesis Example. 



parator’s design hierarchy to include more latch and switch sub-block styles, and 
accommodation of noise and input offset specifications for the comparator. 
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Abstract 

The program TRANALYZE generates a gate-level representation of an MOS transistor circuit. 
The resulting model contains only four-valued unit and zero delay logic primitives, suitable for 
evaluation by conventional gate-level simulators and hardware simulation accelerators. TRANA- 
LYZE has the same generality and accuracy as switch-level simulation, generating models for a 
wide range of technologies and design styles, while expressing the detailed effects of bidirectional 
transistors, stored charge, and multiple signal strengths. It produces models with size comparable 
to ones generated by hand. 



1. Introduction 

Switch-level simulation has proved an effective means for verifying digital 
circuits implemented in MOS technology. By directly working from a transis- 
tor representation, a switch-level simulator can handle a large range of circuit 
designs and capture such subtle effects as bidirectional signal flow.dynamically 
stored charge, and multiple signal sources with different driving impedances. 
Traditionally, switch-level simulation requires evaluation mechanisms that are 
not found in conventional gate-level simulators. To utilize the features and mod- 
eling libraries of existing simulators, and for execution on gate-level hardware 
simulation accelerators,we would like to overcome this incompatibility. 

By automatically generating gate-level models from transistor circuits, we 
can provide a simulation methodology that combines switch-level generality 
and accuracy with gate-level compatibility and performance. In taking this ap- 
proach, we should take care to satisfy several design constraints. First, we must 
not give up the generality and accuracy of switch-level simulation. Second, the 
generated models should be suitable for evaluation by any gate-level simulator. 
Third, the generated models should be as compact as possible. 
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1.1 Previous Work 

A variety of programs have been developed that fall into the general category 
of gate-level model extractors. On close examination, however, we see that all 
of them fall short in at least one of the design goals listed above. 

Most model extractors operate by repeatedly applying transformation rules 
that eliminate or reduce some of the transistor logic and replace it by logic gates 
[3, 4]. These rules typically include transformations such as: series/parallel re- 
duction of pullup and pulldown networks; recognition of static logic gates; iden- 
tification of complementary control signals; merging transistor pairs in transmis- 
sion gates; elimination of “uninteresting” nodes [1]; transistor direction assign- 
ment [2]; and recognition of special structures such as multiplexors and busses. 
Often, these programs encounter cases where none of the transformation rules 
apply, but transistors remain in the circuit model. The user must then either 
supply hand-generated models or add new transformation rules to the model 
extractor. 

An alternative to rule-based approaches is to generate models using methods 
derived from switch-level simulation. This has the advantage of starting from a 
firm foundation in terms of generality and accuracy. The challenge becomes to 
satisfy the remaining goals of compatibility and conciseness. Symbolic switch- 
level analysis, as exemplified by the ANAMOS program [7] (the preprocessor 
for the COSMOS simulator [5]), can extract a logic description from the tran- 
sistor network. ANAMOS generates models consisting of Boolean DAGs: net- 
works of 2- valued operations describing the computation for one unit time step 
of the simulator. These DAGs operate on encoded signal values — each 3-valued 
node state is encoded as a pair of binary values. 

In earlier work at CMU, we developed a program HLGCC to convert the 
Boolean DAGs generated by ANAMOS into logic gates [9]. HLGCC reduces 
the model size by mapping the two-valued operations of the Boolean DAGs 
into 3-valued logic primitives, and merging multiple operations into three-input 
logic gates. These optimizations, however, had only limited success. As an 
example, for a CMOS transmission gate multiplexor, ANAMOS generates a 
DAG containing 17 Boolean operations, which HLGCC maps into 10 three- 
input gates. Ideally, we should be able to generate a single gate model for this 
circuit. Several reasons can be identified for this shortcoming. First, by oper- 
ating on encoded signal values, ANAMOS loses much of the original context 
about the node state values. Second, by analyzing each channel-connected com- 
ponent independently, ANAMOS cannot exploit correlations between signals, 
e.g., complementary signal pairs. Finally, ANAMOS generates a very conser- 
vative model, assuming that all transistors have unit delay, and that the effects 
of X signals should be modeled very strictly. Thus, the generated models must 
include many terms for evaluating sneak paths and unit delay glitches. 
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1.2 New approach 

As with ANAMOS, TRANALYZE performs a symbolic, switch-level anal- 
ysis to extract a logic representation of the circuit behavior. Instead of using 
a binary encoding of signal values, however, it performs the analysis in an al- 
gebra with values 0, 1, X, and Z. The fourth value Z is similar to the “high 
impedance” value used by many gate-level simulators for simulating tri-state 
drivers and busses. The analysis algorithm has the same general form as that 
used in ANAMOS [7] — each channel-connected component is analyzed by set- 
ting up systems of equations, which are solved by a symbolic form of Gaussian 
elimination [6] to yield the logic representation. TRANALYZE differs from its 
predecessor in the following key respects: it represents and manipulates its logic 
description as a 4-valued gate-level network rather than as a Boolean DAG; it 
analyzes the channel-connected components in rank order, so that it can exploit 
the correlations between the input signals of each component; and the optimizer 
can optionally generate models with less conservative, and hence simpler han- 
dling of X signals. 

During the analysis, TRANALYZE applies extensive optimizations of the 
gate-level model to reduce its size. The combination of symbolic analysis plus 
logic optimization effectively performs many of the circuit optimizations imple- 
mented explicitly by rule-based systems. 

2. Generated gate- level models 

In a switch-level network, each node may have state 0, 1, or X, where X indi- 
cates either an unknown value or a potentially nondigital voltage. To represent 
intermediate terms in the analysis, we introduce a fourth value Z indicating the 
absence of a signal. In the gate-level model produced, we can guarantee that no 
circuit node will ever have state Z — at the very least an isolated node will retain 
its stored charge. 

An MOS transistor can serve in two different capacities. First, as in relay 
contact networks, series-parallel (and other) configurations can be formed that 
conditionally connect two terminals according to some logic function of the 
transistor gate nodes. We will refer to this as “And/Or” logic. Second, a transis- 
tor can propagate a data value from its source to its drain (or vice-versa) under 
the control of its gate node. We will refer to this as “Steering” logic. Typical 
MOS circuits contains both forms of logic. Logic gates are created using And/Or 
logic as pullup and pulldown networks. Transmission gates, multiplexors, and 
carry chains are created using steering logic. Our signal algebra includes opera- 
tors to express both forms of logic. 
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2.1 Primitive gates 

We express And/Or logic using gate types AND and INVERT. The OR operation 
is expressed in terms of these two gate types according to DeMorgan’s Laws, 
aiding the detection of identical and complementary terms. We can guarantee 
that the Z state will never arise in And/Or logic. 

We express Steering logic using gate types ENABLE and MERGE. These opera- 
tions are defined as follows, where entry indicates a condition that will never 
arise in the generated gate-level network: 
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The ENABLE gate conditionally propagates the data on input a according to the 
control signal on input e. As the table indicates, the control input can never equal 
z — its value is generated by And/Or logic. This gate has functionality similar to 
that of a tri-state buffer, except that it treats control signal value X the same as 
a 1. TRANALYZE resolves the effects of unknown control signals on tri-state 
buffers by including additional gates in the model. The MERGE gate combines 
two 4-valued signals. Its functionality is identical to a “wired-logic” function in 
tri-state logic. As a final gate type, the DELAY gate implements a unit delay. All 
other gates have zero delay. 

2.2 Gate-level logic example 

Figure 1 shows a CMOS circuit and the generated gate network. As can 
be seen, the program successfully recognizes that the pullup and pulldown net- 
works form a NOR gate driving node S. The logic generated for node T selects 
either the output of the NOR gate, or the stored value on node T (delayed by 1 
time unit), depending on control signal C. After technology mapping, TRAN- 
ALYZE generates a 2 gate model, consisting of a NOR and a multiplexor. In 
contrast, HLGCC generates a 7 gate model. 

3. Analysis method 

Following the parsing of a transistor netlist file, the analysis proceeds by a 
series of steps. 
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3.1 Partitioning and ordering 

The transistor circuit is first partitioned into “channel-connected components,” 
each consisting of a set of storage nodes connected by the source-drain terminals 
of transistors. Each such region is analyzed separately. 

TRANALYZE exploits the logical dependencies implied by zero delay tran- 
sistor logic in its logic optimization. It does this by analyzing the circuit com- 
ponents in rank order, so that as a component is analyzed, the gate-level models 
for the signals controlling zero delay transistors are available to the optimizer. 
TRANALYZE automatically inserts unit delays on some of the transistors to 
enable a rank ordering and to avoid generating a gate network with zero delay 
cycles. It does this by a simple greedy method during the topological sorting of 
the components. 

3.2 Symbolic analysis 

During the analysis of a channel-connected component, gates are added to the 
generated network describing the functionality of the component nodes. Each 
component node is temporarily treated as a primary input fed through a unit 
delay to represent the initial node charge. Working from the maximum strength 
level downward, two systems of equations are set up and solved symbolically at 
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each strength level. The gate outputs representing the final steady state solution 
are then connected to the primary inputs that were temporarily introduced for 
the storage nodes. 

For strength level s, the first system of equations, termed the “clear” equa- 
tions, expresses the conditions under which each node is not the destination of 
a definite path of strength s. These equations are expressed in terms of And/Or 
logic. The second system, termed the “state” equations, expresses the combined 
effect of all unblocked paths of strength greater than or equal to s to each node. 
These equations are expressed in terms of Steering logic, with ENABLE express- 
ing conditional signal propagation and MERGE expressing the effect of multiple 
signals to a single destination. 

3.3 Symbolic manipulation 

During the setting up and the solving of equations, each signal corresponds 
to either a primary input or the output of a logic gate. To “compute” the result 
of applying some operation (AND, MERGE, etc.) to a set of argument signals, the 
symbolic manipulator either finds an existing gate with the appropriate function- 
ality, or it adds a new gate to the network having the argument signals as inputs. 
The symbolic manipulator also applies extensive optimizations as it proceeds. 
Most of the optimizations are similar to those found in ANAMOS, as well as in 
many optimizing compilers, e.g., constant evaluation, common subexpression 
detection, etc., except that it performs these optimizations in the 4- valued sig- 
nal algebra. Although most of these optimizations require only constant time, 
others involve attempting a simple form of proof by contradiction to show that a 
candidate gate would always produce an output value equal to one of its inputs. 

Unlike ANAMOS where the effects of X signal values are modeled very con- 
servatively, TRANALYZE can generate models with “Boolean” optimization, 
ignoring the effects of X values in And/Or logic. For most applications, de- 
signers are willing to give up detailed modeling of X values in favor of simpler 
gate-level models. Indeed, they would find the gate model generated by Boolean 
optimization less prone to false X value propagation. 

Additional transformations are implemented by the symbolic manipulator to 
convert Steering logic into And/Or logic. The first two illustrated in Figure 
2 effectively perform series-parallel reduction of the transistor network. The 
first takes a chain of ENABLE gates, such as arises from the analysis of a series 
transistor chain, and collapses it into a single ENABLE controlled by an AND. 
The second takes the MERGE of a set of ENABLE gates having a common data 
input, such as arises from the analysis of a set of parallel paths, and collapses 
it into a single ENABLE gate controlled by the OR of the parallel control signals. 
This OR is implemented as a combination of INVERT and AND to exploit the 
equivalences implied by DeMorgan’s Laws. The third rule of Figure 2 illustrates 
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Figure 3. Performance of Previous Switch-Level Analysis Tools. Figure of merit is defined 
as (Gate Count)/(Transistor Count). 



the final transformation required to extract static logic gates. The reduction of 
the pullup and pulldown networks by series-parallel transformations yields a 
pair of complementary signals. These signals control ENABLE gates to constants 
1 and 0, representing power and ground. This final configuration of ENABLE and 
MERGE gates can then be eliminated. 

3.4 Network pruning 

During the analysis, TRANALYZE generates gate level logic describing the 
functionality of every storage node in the circuit, including such nodes as the 
intermediate points in series transistor chains. TRANALYZE prunes the net- 
work by a form of mark-sweep garbage collection. Starting with the gates rep- 
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Figure 4 - Performance of TR ANALYZE. Figure of merit is defined as (Gate 
Count)/(Transistor Count). 



resenting the circuit nodes that the user wishes to observe during simulation, 
the program traces backward, marking all reachable gates. Any unmarked gate 
can then be eliminated; its output value can have no bearing on the simulation 
results. In this manner, the program prunes a large fraction of the nodes from 
the transistor circuit, eliminating the need for heuristic methods to identify these 
nodes initially [1]. As an example, connection node E in the pullup chain of the 
NOR gate of Figure 1 is successfully eliminated by the pruning. 

3.5 Technology mapping 

As the final stage, the primitive gates are merged into a smaller number of 
more complex gate types. The initial target simulator for TRANALYZE is 
the hardware simulation machine SP [8]. This machine can model arbitrary 
4-valued gates with up to four inputs. Our technology mapper merges trees of 
gates, using a simple tree matching algorithm [10]. 

4. Experimental results 

Figures 3 and 4 show the results of the different analysis methods for sev- 
eral switch-level benchmarks. Results are given for 4 different representations: 
the Boolean DAG model generated by ANAMOS, the ternary gate model gen- 
erated by HLGCC, and the 4-valued models generated by TRANALYZE for 
two extremes of network optimization. Unit/temary indicates that all transistors 
have unit delay, and X values are modeled conservatively. The resulting model 
is functionally equivalent to that produced by ANAMOS. Zero/binary indicates 
that transistors are assigned unit delay only to break feedback loops, and with 
Boolean optimization. 

As a figure of merit, ratios are given of the number of gates (or DAG nodes) to 
the number of transistors in the circuit. Low (less than 1.0) values indicate that 
the program is able to abstract the circuit behavior. High numbers indicate cases 
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where a complex gate-level model is required to capture the subtleties of switch- 
level behavior. As this figures indicate, TRANALYZE consistently outperforms 
both ANAMOS and HLGCC, even when forced to generate models with exactly 
the same functionality. 

The 74181 ALU is a direct mapping of a gate-level ALU into static CMOS 
gates. Both HLGCC and TRANALYZE successfully recognize all of the gates, 
except for the XORs. The Shift64 circuit is a 64-bit transmission gate shift reg- 
ister that can either shift or hold its data on each clock cycle. TRANALYZE 
can reduce each stage to just 3 four-input gates — comparable to a hand gener- 
ated model. The DRAM circuit is an nMOS 3-transistor dynamic RAM. This 
represents a difficult case for gate-level model generation, since any gate-level 
implementation of a RAM is far more complex than what can be implemented 
using custom transistor logic. The SLAP circuit is a 16-bit CMOS processor 
designed at CMU. It contains many difficult structures for gate-level model ex- 
traction, including a register file implemented as a static RAM, a Manchester 
carry chain ALU, and a transmission gate shifter network. Even for these more 
difficult circuits, TRANALY2E is able to generate reasonably concise models. 

Thus far, we have successfully simulated all of the benchmark circuits except 
for SLAP on SP. For the largest circuit simulated (DRAM256), SP operates 80 
times faster than COSMOS executing on a SUN-4/1 10. Even greater speedups 
can be expected for larger circuits. 
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Abstract 

Optimization of a circuit by transistor sizing is often a slow, tedious and iterative manual process 
which relies on designer intuition. Circuit simulation is carried out in the inner loop of this tuning 
procedure. Automating the transistor sizing process is an important step towards being able to 
rapidly design high-performance, custom circuits. JiffyTune is a new circuit optimization tool 
that automates the tuning task. Delay, rise/fall time, area and power targets are accommodated. 
Each (weighted) target can be either a constraint or an objective function. Minimax optimization 
is supported. Transistors can be ratioed and similar structures grouped to ensure regular layouts. 
Bounds on transistor widths are supported. 

JiffyTune uses LANCELOT, a large-scale nonlinear optimization package with an augmented 
Lagrangian formulation. Simple bounds are handled explicitly and trust region methods are ap- 
plied to minimize a composite objective function. In the inner loop of the optimization, the fast 
circuit simulator SPECS is used to evaluate the circuit. SPECS is unique in its ability to ef- 
ficiently provide time-domain sensitivities, thereby enabling gradient-based optimization. Both 
the adjoint and direct methods of sensitivity computation have been implemented in SPECS. 

To assist the user, interfaces in the Cadence and SLED design systems have been constructed. 
These interfaces automate the specification of the optimization task, the running of the optimizer 
and the back-annotation of the results on to the circuit schematic. 

JiffyTune has been used to tune over 100 circuits for a custom, high-performance micropro- 
cessor that makes use of dynamic logic circuits. Circuits with over 250 tunable transistors have 
been successfully optimized. Automatic circuit tuning has been found to facilitate design re-use. 
The designers’ focus shifts from solving the optimization problem to specifying it correctly and 
completely. This paper describes the algorithms of JiffyTune, the environment in which it is used 
and presents a case study of the application of JiffyTune to individual circuits of the micropro- 
cessor. 



1. Introduction, motivation and previous work 

Designers often spend a lot of time manually sizing their schematics for area, de- 
lay and power, particularly in the context of custom designs. The tuning process 
is iterative, slow, tedious and error-prone, with circuit simulation in the inner 
loop. The updating of transistor widths from one iteration to the next relies on 
human intuition. Automating the circuit optimization process is an important 
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step towards rapidly designing high performance, custom circuits. Automatic 
circuit tuning has the additional benefit of facilitating design adaptation and 
re-use. Hence an automatic tuning (and retuning) capability is crucial to the 
productive design of custom circuits. 

There have been many attempts to automate the transistor sizing problem. 
The first class of methods [1, 2] is based on static timing analysis [3] in which 
the circuit is assumed to consist of pre-characterized library cells. The delay 
of each cell is available as an analytic function of the sizes of the transistors in 
the cell. Total path delay is expressed as a function of the individual transistor 
widths and optimized. In particular, if the Elmore delay model [4, 5] is used, this 
overall delay is seen to be a posynomial function (a particular algebraic form) 
of the transistor widths. By a simple mapping of variables, the objective is 
converted to a convex function [1], and hence any minimum of the latter is guar- 
anteed to be a global minimum. The advantages of static-timing-based methods 
include efficiency, ability to handle large designs and freedom from requiring 
input patterns to carry out the tuning. One of the problems with these methods 
is that they are not applicable to full-custom circuit designs, since static timing 
analyzers usually rely on pre-characterized library cells. Second, the accuracy 
of static timing analysis is limited (to about ±25% in our experience) making 
it unsuitable as a basis for tuning high-performance custom circuits. Finally, 
static timing analysis is prone to the false-path problem, so the optimizer may 
be working hard to tune paths that are either irrelevant or can never be sensitized. 
Recently, power optimization has been proposed in this general framework [6]. 
Power is measured by probabilistic methods [7] and then approximated by a 
posynomial function. Simultaneous tuning of drivers and interconnect has been 
proposed in [8, 9]. 

Tuning based on dynamic simulation overcomes many of the above limita- 
tions of static tuning. The accuracy is as good as the simulator employed, false 
paths are not a problem and the method is applicable to any custom circuitry 
that the simulator can analyze. Appropriate input patterns must be provided by 
the user. These methods [10, 11] typically run SPICE in the inner loop to opti- 
mize such circuit performance functions as gain, area, delay and phase margin. 
However, using SPICE iteratively is computationally expensive and limits the 
size of the circuit that can be tuned. From an overall design perspective, we see 
static and dynamic methods complementing each other at different stages of the 
methodology, depending on the type of design. 

In this paper, we present a method for tuning custom MOS circuits that uses 
dynamic simulation and gradient-based optimization. Our ability to compute 
gradients efficiently is cmcial to the success of this approach. JiffyTune is a 
prototype implementation of our method. An overview of JiffyTune is pre- 
sented in Section 2. JiffyTune uses SPECS [12, 13], a fast circuit simulator, 
to evaluate the circuit and provide function and gradient values. SPECS and the 
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Figure 1. High-level view of JifFyTune. 



computation of sensitivities are the topics of Section 3. The optimization engine 
used in JiffyTune is LANCELOT [14, 15, 16], a large-scale nonlinear program- 
ming package that handles simple bounds explicitly and accommodates general 
constraints with an augmented Lagrangian formulation.LANCELOT has been 
customized to the circuit tuning problem. The numerical methods involved in 
the nonlinear optimization are described in Section 4. To make the tuning envi- 
ronment productive and intuitive, interfaces have been built in two different de- 
sign environments. Section 5 is devoted to the concepts guiding these interfaces 
and their benefits. JiffyTune has been used on many custom, dynamic-logic cir- 
cuits of a high-performance microprocessor. A case study of the application of 
JiffyTune to this chip design, along with benchmarks, is presented in Section 6, 
followed by a section containing conclusions and future work. 

2. Overview of JiffyTune 

This section provides an overview of the various high-level software compo- 
nents of JiffyTune, as depicted in Figure 1. Subsequent sections contain detailed 
descriptions of the individual components. 

The JiffyTune “engine” solves the following problem. Given a circuit schema- 
tic, input signals, a list of tunable transistors with initial widths and a set of cir- 
cuit performance requirements, determine the optimal assignment of transistor 
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widths to tunable transistors in order to achieve the requirements. The user in- 
terface makes it convenient for the user to specify the problem, and visualize 
and accept the results of optimization. 

2.1 JiffyTune 

The JiffyTune block in Figure 1 performs the administrative portion of the tun- 
ing task. A control file grammar has been defined for the specification of circuit 
optimization problems. The control file contains the following information. 

Parameters: This section contains a list of tunable transistors, their initial 
widths and bounds. Tunable transistors can be ratioed to other tunable transis- 
tors. Further, the user interface allows grouping of instances of similar struc- 
tures so that they track each other during tuning. Thus, for example, the cells 
of an n-bit wide multiplexer can be grouped to ensure that the cells stay identi- 
cal through the tuning process and thus lend themselves to a structured, regular 
layout. 

Measurements and functions: A measurement is either a crossing-time, power 
or area measurement. In the absence of layout information, area is modeled by 
the sum of the tunable transistors’ widths. Functions consist of any linear combi- 
nation of measurements. Thus delays and rise/fall times are typically the differ- 
ence of two crossing times. Each function has a weight, a target and a relation. 
A relation of “less than” implies that this function should be less than the target 
value. Similarly, relations of “greater than” and “equal to” are allowed. Alter- 
nately, a function can be “minimized” which means that the optimizer will try 
to decrease the value of this function as much as possible. Weights can be used 
to explore various trade-offs in tuning the circuit; they are especially required 
when functions of different quantities (area, delay, power) are being combined 
into a composite objective function. 

Any number of functions can be grouped into a minimax function. A mini- 
max function implies that the largest of some number of functions needs to be 
minimiz ed or must meet a constraint. For example, the statement of the prob- 
lem might be to minimize the delay of the worst of three paths through some 
combinational logic block. 

Controls: This section provides administrative information like the maximum 
number of iterations, the layout grid for rounding transistor widths at the end of 
optimization and the location of the device model files. 

JiffyTune reads the control file and internally represents the problem in a for- 
mat that is understood by LANCELOT. JiffyTune also provides to LANCELOT 
a callable routine that will accept a set of transistor widths, perform a SPECS 
simulation, and return function and gradient values in the form required by 
LANCELOT. Then JiffyTune begins a LANCELOT optimization. At each it- 
eration, JiffyTune keeps track of the best results so far. One of the main func- 
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tions of the JiffyTune block is to chain rule and combine gradients to provide 
to LANCELOT the gradients of various functions with respect to independent 
variables only. Typically, 25 to 30 iterations are required for convergence. The 
default maximum number of iterations in JiffyTune is 50. The recently imple- 
mented slack updating method described in Section 4.2 has led to fewer itera- 
tions being required in general. 

2.2 SPECS 

SPECS is a fast circuit simulator that uses simplified device models and event- 
driven techniques. JiffyTune calls SPECS in the inner loop to evaluate the cir- 
cuit, and provide function and gradient values. SPECS and its sensitivity com- 
putation capabilities are described in Section 3. 

2.3 LANCELOT 

LANCELOT is a large-scale nonlinear optimization package that handles sim- 
ple bounds and general constraints. JiffyTune provides the problem description 
and initial transistor sizes to LANCELOT. LANCELOT repeatedly calls SPECS 
with different transistor size settings, and builds a model of the “performance 
surface” of the circuit. It uses sophisticated nonlinear programming techniques 
to minimi z e the objective function. Details regarding LANCELOT and its appli- 
cation to the circuit tuning problem are provided in Section 4. 

In addition to LANCELOT, the Levenberg-Marquardt [17] and Minos [18] 
optimization packages have been integrated into JiffyTune. The optimization 
testing environment described in [19] was used to integrate Minos into Jiffy- 
Tune. The Levenberg-Marquardt method is limited since it only performs un- 
constrained optimization and is relatively unsophisticated. The Minos integra- 
tion has been used only for comparisons and “sanity checks.” 

2.4 The user interface 

JiffyTune requires a knowledgeable user to carefully specify the optimization 
problem, and greedily takes advantage of any unspecified aspects. Further, the 
engine requires a control file that is difficult to create manually, particularly for 
large circuits. The user interface helps the user concentrate on the specification 
of the optimization problem by providing an intuitive interface and eliminating 
the tedium of dealing with a file-driven tool. It also provides facilities for back- 
annotation of the results of tuning. Section 5 is devoted to a discussion of the 
environment in which JiffyTune is used and the description of the interface. 
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3. SPECS and time-domain sensitivity 
computation 

SPECS (Simulation Program for Electronic Circuits and Systems) is a fast cir- 
cuit simulation program. SPECS is on average two orders of magnitude faster 
than AS/X, an internal traditional circuit analysis program [20] like SPICE. 
SPECS uses simplified device models and event-driven techniques to efficiently 
simulate MOSFET circuits in the time-domain, and has been used in production 
mode in various integrated circuit designs. JiffyTune uses SPECS to evaluate 
the circuit being optimized. However, this paper will not describe SPECS in 
any detail. The reader is referred to [12, 13, 21]. The device modeling assump- 
tions in SPECS restrict its relative timing accuracy to ±5%, and hence JiffyTune 
can only tune to within this accuracy limit. 

3.1 Sensitivity computation 

SPECS uses simplified device models that consist of piecewise constant char- 
acteristics in multiple dimensions and grounded, linear capacitances. These 
simplifications allow efficient, incremental time-domain sensitivity computa- 
tion [22, 23, 24]. Both the adjoint [25, 21] and direct [26] method have been 
implemented. In the direct method, branch constitutive relations (device char- 
acteristics) are directly differentiated with respect to the sensitivity parameter of 
interest. The circuit reflecting these differentiated equations, called the sen- 
sitivity circuit, has the same topology as the original circuit. Since SPECS 
uses piecewise constant device models, the sensitivity circuit consists of discon- 
nected capacitances for large sub-intervals of time, with occasional impulses of 
currents flowing between these capacitances at times corresponding to events in 
the nominal simulation. Thus the solution of the sensitivity circuit is extremely 
efficient. In the direct method, the sensitivities of all functions with respect to 
one parameter are computed with a single solution of the sensitivity circuit. 

In the adjoint method, elements are replaced by adjoint equivalents based on 
Tellegen’s theorem [25, 21]. Again, the circuit is very simple and lends itself 
to efficient solution. In this case, however, time is run backwards in the adjoint 
circuit, and the waveforms of the adjoint circuit are convolved with those of the 
original circuit to obtain the required sensitivities. The gradients of one function 
with respect to all parameters are computed in a single solution of the adjoint 
circuit. Hence, when there are sufficiently more parameters than functions to 
justify the overhead of convolution, the adjoint method is advantageous. 

Once the approximation in the simplified device models is accepted, the com- 
putation of gradients is exact. After the sensitivity circuit is solved in either 
method, gradients are chain-mled and combined to obtain the sensitivity of each 
function with respect to all the ramifications of variation of the tunable transis- 
tors’ widths. When the width of a transistor varies, its source and drain diffu- 
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sion capacitance and all the intrinsic MOSFET parasitic capacitances change. 
Each of these is submitted as an internal sensitivity parameter and then all the 
gradients are post processed and combined appropriately. The flavor of these 
computations is captured by the following simplified equation. 

^ ^ df dWeff df dCDtotal 

dW dWeff dW ^dCDtotai dW 

df dCStotal 3/ dCGtotal 
dCStotai dW dCGtotal dW ' 

where / is the sensitivity function of interest, W is the transistor width (sensi- 
tivity parameter), Wgfj is the effective width and CGtotah CStotai and CDtotai are 
the total parasitic capacitance at the gate, source and drain nodes, respectively. 
(0) is further expanded in terms of the device model parameters. 

Voltage crossing sensitivities are expressed in terms of the nominal voltage 
waveform and transient sensitivity waveform of the appropriate signal, both 
sampled at the time of the voltage crossing of interest. In the case of the ad- 
joint method, the transient sensitivity waveform is sampled by expressing the 
required value as a convolution integral and choosing to excite the adjoint cir- 
cuit by an appropriate current source connected to that node. 

3.2 Sensitivity benchmarks 

The number of time-domain gradients computed during a typical JiffyTune run 
may be in the millions! Hence gradient computation must be extremely effi- 
cient to make this process feasible. A dynamic logic “branch scan” circuit with 
144 MOSFETs was chosen to demonstrate the efficiency of gradient computa- 
tion. The circuit was simulated in SPECS for a simulation interval of 27 ns. 
The CPU time for simulation was 2.05 s on an IBM Risc/System 6000 model 
590. Then the same simulation run was carried out with 36 sensitivity functions 
(crossing times) and 104 MOS transistor widths as sensitivity parameters. Since 
there were 64 diffusion and other parasitic capacitances dependent on these 104 
transistor widths, the total number of sensitivity parameters was 168. The num- 
ber of gradients computed in this benchmark was 6,048, since SPECS finds the 
gradient of every sensitivity function with respect to each sensitivity parameter 
(our Jacobian matrix is dense). The number of sensitivities required was un- 
usually large in this example, which was chosen to showcase the efficiency of 
gradient computation. The run times of SPECS with both the adjoint and direct 
method on this benchmark circuit are shown in Table 1. From the table, we 
see that the total run time for a JiffyTune iteration would be 24.94 s (assuming 
that the direct method were used). For comparison, the AS/X [20] run time 
on this circuit (with no gradient computation, of course) was 40.11 s. Hence, 
even on this modest example, JiffyTune can almost complete two iterations with 
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gradient computation in the time it takes AS/X to simulate the nominal circuit 
once. 



Run time 
in CPU seconds 


Adjoint 

method 


Direct 

method 


Total run time 


32.38 


24.94 


Run time for sensitivity computation only 


30.32 


22.89 


Run time per sensitivity circuit solution 


0.84 


0.14 


Run time per sensitivity circuit solution 
as a fraction of simulation time (2.05 s) 


40.78% 


6.65% 


Run time per gradient computation 


5.01e-3 


3.78e-3 


Run time per gradient evaluation as a 
fraction of simulation time 


0.24% 


0.18% 



Table 1. Sensitivity computation run time. 



As can be seen from the table, the overhead of computing one gradient is a 
fraction of a percent of the original simulation time, which works out to 5 ms or 
less of CPU time in this example! The overhead of one sensitivity circuit anal- 
ysis is about 7% for the direct method and about 40% for the adjoint method. 
Note that the number of runs in the adjoint method is equal to the number of 
functions, while it is equal to the number of sensitivity parameters in the di- 
rect method. The higher overhead in the adjoint method is accounted for by 
the convolution required between the waveforms of the original circuit and the 
sensitivity circuit. SPECS inspects the number of functions and the number of 
parameters and automatically makes a judgment, based on a simple heuristic, as 
to which method will be more efficient. The heuristic favors the adjoint method 
if the number of parameters exceeds the number of functions by a factor of 5. 
This heuristic appears to be effective in practice. 

4. Nonlinear optimization in JiffyTune 
4.1 LANCELOT 

The optimization engine of JiffyTune is based upon the large-scale nonlinear 
progr ammin g package LANCELOT. The kernel algorithm is an adaptation of 
a trust region method to the general nonlinear optimization problem subject to 
simple bounds. The method is extended to accommodate general constraints by 
using an augmented Lagrangian formulation and the bounds are handled directly 
and explicitly via projections that are easy to compute. 

In the context of unconstrained optimization, trust region methods, combin- 
ing an intuitive framework with a powerful and elegant theoretical foundation, 
have led to robust numerical implementations. An excellent reference is [27]. 
The basic idea of trust region methods is to approximately minimize a model 
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of the objective function in a local neighborhood (called the trust region) cen- 
tered at the current point. The objective function is modeled about the current 
point a:* € where k is the iteration count and jc is the n- vector of variables. 
To minimize the model in the trust region, a step s* is taken at iteration k to 
arrive at the point x* -I- /. The function is evaluated at this point to determine 
how well the model predicted the actual change in the objective function. If 
good descent is obtained, the approximate minimizer is accepted as the next it- 
erate ■<- X* -|-/) and the tmst region is expanded. If moderate descent is 
obtained, the trust region size remains unchanged, but the step is accepted. Oth- 
erwise, no new point is accepted and the trust region is contracted. The beauty of 
such an approach is that, when the tmst region is small enough and the problem 
smooth, the approximation is good, provided the gradients are sufficiently accu- 
rate. Moreover, assuming one does at least as well as the minimum along the 
steepest descent direction of the model within the tmst region (which determines 
the so-called Cauchy point), one can ensure convergence to a stationary point. 
In addition, the tmst region is eventually expanded so that it does not interfere 
with the subsequent iterates, and thus, assuming that in this situation the un- 
derlying algorithm is sufficiently sophisticated, one can ensure fast asymptotic 
convergence. 

The extension of the above ideas to problems with simple bounds is relatively 
straightforward and is illustrated in Figure 2, for a quadratic model function 
and an /o® tmst region. Essentially, one generalizes the Cauchy point (vk in the 
figure) to the minimum along the projected gradient path (x* — w — w) within 
the tmst region, where the projection is with respect to the bounds (either those 
provided by the user or implicit in the tmst region). As in the previous case, 
global convergence can be guaranteed, provided one does at least as well as the 
generalized Cauchy point (w). If a variable, as determined by the generalized 
Cauchy point, is at a bound it is said to be an activity. Unbounded variables 
aiefree. Activities are fixed temporarily, thus reducing the dimensionality of 
the search space (from two to one in the figure). Then, using only the free 
variables, the model of the objective function is further minimized within the 
feasible region and within the tmst region (w is optimal in Figure 2). Thus one 
obtains better convergence, and ultimately, satisfactory asymptotic convergence. 
Updating of the tmst region size and current point is handled in exactly the same 
way as it is in the unconstrained case. 

It has been proved [14] that this method converges to a Kuhn-Tucker point 
[28]. Moreover, the correct active simple bounds are identified after a finite 
number of iterations assuming that strict complementarity is satisfied and the 
activities determined by the generalized Cauchy point are kept active during the 
rest of the iteration when the model is further reduced. Details are given in [14]. 
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Figure 2. Illustration of generalized Cauchy point. 



The extension to handle equality constraints is carried out by means of an 
augmented Lagrangian function 

m 1 m 

<^{x,X,ix)= f{x) + J XiCi W + :^ X 

i=i 

<J) is minimiz ed subject to the explicit bounds, using the earlier algorithm. Here 
/ is the objective function, x the variables of the optimization, c, (x) is an equal- 
ity constraint with Xi being the corresponding Lagrangian multiplier and fi the 
penalty parameter used to dynamically weight feasibility. Inequality constraints 
are converted to equality constraints by first introducing slack or surplus vari- 
ables, if necessary, and then formulating the augmented Lagrangian as before. 
This approach can be summarized as follows: 

1 Test for convergence using the two following conditions. Sufficient sta- 
tionarity - the projected gradient of the augmented Lagrangian with re- 
spect to the simple bounds is sufficiently small, and sufficient feasibility - 
the norm of the constraint violations is sufficiently small. 

2 Use the simple bounds algorithm to find an approximate stationary point 
(minimizer) of <t> subject to simple bounds. 

3 If sufficiently feasible, update the multipliers X{ and decrease the toler- 
ances for stationarity and feasibility. 
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4 Otherwise, give more weight to feasibility (decrease fi) and reset toler- 
ances for stationarity and feasibility. 

It is possible to show, under suitable conditions, that convergence to a first-order 
stationary point for the nonlinear programming problem is attained. Further, if 
there is a single limit point, eventually the penalty parameter p is not reduced. 
Details of these and other theoretical properties are given in [15] and [29]. 

A significant cost in the optimization is solving a linear system of equations. 
Typically these arise from the necessity to determine an approximate stationary 
point for a quadratic function - equivalently, the necessity to solve a linear sys- 
tem whose coefficient matrix is symmetric. If the system is large, there are two 
approaches. The first is to use direct methods based on multi-frontal techniques 
(see Chapter 10 of [30]). Our experience to date, however, has been that an it- 
erative approach using preconditioned conjugate gradients is more robust. All 
our reported numerical results with JiffyTune use this method. The appeal of 
conjugate-gradient methods for large-scale optimization is that they are particu- 
larly simple and only require that we store a few vectors. Moreover, they can be 
significantly accelerated by the use of preconditioners. Perhaps the best known 
conjugate basis for a convex quadratic form is the set of eigenvectors of the Hes- 
sian. The essential result is that at each iteration, the conjugate gradient method 
minimizes the quadratic model in the space spanned by the corresponding con- 
jugate basis. If we can cluster eigenvalues (i.e., approximately have multiple 
eigenvalues) we can reduce the number of iterations for good approximations 
to minimizers from close to n to close to the number of clusters. The perfect 
way to do this in the quadratic case is to precondition with the Hessian inverse 
- but then this is equivalent to carrying out Newton’s method. Surprisingly, one 
can often do very well by using crude approximations to the Hessian (diago- 
nal matrices, for instance). A good description of conjugate methods is given 
in [28], Sections 4.8.3 and 4.8.5. The LANCELOT package offers several pre- 
conditioners, and the Schnabel-Eskow preconditioner [31] is used in JiffyTune. 
A detailed reference on the LANCELOT package, including all the available 
options, is given in the book [16] that accompanies the original software. 

4.2 Application of LANCELOT to JiffyTune 

In the context of JiffyTune it was necessary to make certain modifications to 
LANCELOT to account for the fact that the function and gradient values from 
SPECS, although accurate to within small perturbations, are noisy. The intro- 
duced errors are small but significantly larger than machine precision. Because 
of the complexity of general nonlinear optimization, many initializations (such 
as the choice of the trust region radius or quadratic model) are based upon in- 
telligent guesses, which cannot, of course, be ideal in all circumstances. In the 
worst case, for functions without noise, unfortunate choices can result in ineffi- 
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ciencies, but in the noisy case they can be insunnountable. A trivial example is 
if movement of less than 0.001 microns in transistor width is considered negli- 
gible, it may be disastrous if an automatic choice of the initial trust region size 
produces a radius that is much smaller. For similar reasons, we had to introduce 
looser tolerances for feasibility, line search discontinuities and bound activities, 
which are based upon machine precision in the original software. Finally, in or- 
der to stop gracefully and predictably we needed to consider step sizes beneath 
which further progress is unlikely and relate stopping criteria to this step size in 
a robust and consistent manner. 

Two other enhancements deserve special mention. Slack/surplus variables 
corresponding to satisfied inequalities are updated at each iteration so that the 
corresponding equality is satisfied exactly, whenever such an update is consis- 
tent with the convergence theory; the result has been a reduction in the number 
of iterations to convergence. 

Minimax optimization is handled by the introduction of an additional lin- 
ear variable and reformulating the problem as a general nonlinear programming 
problem. For example, suppose one had the problem 



min max fAx). 



(3) 



This problem can be reformulated as 
min z 

subject to z — fi{x) >0, 1 < i < m. (4) 

5. JiffyTune interface and environment 

The JiffyTune engine as described above is driven by a textual control file that 
describes the optimization problem. Manual preparation and editing of such a 
file is tedious and error-prone. Also, the sophistication of LANCELOT and the 
choices of algorithms and tolerances thereof are not directly relevant to the end 
user. Thus, from the inception of the JiffyTune project, it was realized that a 
good human interface and an intuitive abstraction of its use and behavior would 
be crucial to acceptance of the tool by circuit designers. Interfaces were built to 
run the tool from the Cadence [32] and SLED [33] schematic design systems. 
The interface in the Cadence design environment was evolved simultaneously 
with the JiffyTune engine. Integrating the tool into such a framework capital- 
izes on the familiarity of the user with the schematic design environment, and 
lends a visual and interactive aspect to the tool. Many of the complexities are 
hidden from the designer, although care was taken to allow full access to all tool 
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functions, if the designer so requires. The basic functions of the interface are 
listed below. 

Specification of tuning parameters: Tunable transistors are specified simply 
by selecting transistors or gates on a hierarchical schematic. The tunable tran- 
sistors/gates are visually marked by a flag to indicate tunability. Facilities are 
provided to ratio transistors. Thus the two NFETs in a NAND gate can be forced 
to have the same width or adhere to a given tapering ratio. In addition, similar 
instances (transistors, gates or higher-level functional blocks) can be “grouped” 
together, to ensure that corresponding transistors in those blocks track during 
tuning. 

Specification of measurements and functions: Presently, the interface sup- 
ports delay, transition time (slew), area and power functions. For delay and 
transition times, net selection is done directly on the schematic. Power func- 
tions are specified by selecting the required voltage source, again directly on 
the schematic. In all cases, the user is prompted to provide a relation and tar- 
get value as described in Section 2.1. In the schematic environment, with no 
knowledge of layout, area targets are approximated by the sum of the widths of 
the tunable transistors. The appropriate linear combination of measurements is 
written to the control file in each case. Minimax functions can be defined over 
any set of existing measurements. 

Specification of controls: Administrative information such as the maximum 
number of iterations, file location of device models and layout grid for rounding 
transistor widths at the end of optimization can be specified in a form that is 
pre-filled with project-specific defaults. 

Execution o/ JiffyTune.- After specifying parameters, functions and controls, 
the designer can ask for all this information to be written to a control file, which 
can be inspected or edited if required. Then the designer can launch the Jiffy- 
Tune engine, whereupon the progress of the optimization is displayed. 

Back-annotation of the results: The results of a JiffyTune run are back- 
annotated onto the schematic as suggested transistor widths next to the tran- 
sistors (or as new parameters next to gates). The designer can then accept these 
new widths/parameters, selectively or as a whole. Further, a facility is provided 
to back-annotate final waveform characteristics, such as delay through a gate or 
rise time of a net, directly onto the schematics, relieving the designer of the need 
to browse through simulation data using a waveform viewer. 

Utilities: As a courtesy to the designer, the JiffyTune menu also includes fa- 
cilities replicated from other areas of the schematic design environment, such as 
schematic checking, netlisting and automatically adjusting the number of fingers 
on each transistor, to create a single integrated tuning environment. Portability 
to various different sites and projects has been achieved by carefully separating 
the main code of the user interface from configurable site- and project-specific 
code. 
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Circuit requirements must be specified with care, since the optimizer will 
take advantage of any unspecified aspects. For example, area minimization will 
shrink to its minimum size a transistor that does not contribute materially to 
any measured transition. Thus, the tool enforces clear expression of circuit re- 
quirements that otherwise are often tacit. Since these circuit requirements and 
attributes logically belong with the circuit (they are indeed part of the intel- 
lectual effort of designing the circuit), the tuning parameters and functions are 
stored in the design database, either as instance properties (tunability, upper and 
lower bounds on transistor widths) or as schematic properties (grouping, func- 
tions). This practice also encourages the reuse of circuits; if a circuit has been 
adequately specified, it can easily be retuned. 

6. Case study of JiffyTune use 

JiffyTune was applied to tune custom circuits in the critical paths of a high- 
performance, dynamic-logic microprocessor. The circuits consisted of a mix of 
transistors and continuously parameterized gates. JiffyTune made it possible to 
refine the transistor sizes of circuits more quickly and thus rapidly respond to 
design changes late in the chip design cycle. Thus more flexibility was preserved 
in changing the specifications of circuits. The Cadence graphical user interface 
made it possible for designers to use the tool with little or no training. 

JiffyTune was used by 41 designers during about 1,200 interactive sessions to 
tune 168 unique circuits. Over 2,200 successful JiffyTune runs were carried out, 
showing that some circuits were re-tuned multiple times. The results of tuning 
on one particular benchmark circuit are presented below. 

Table 2 lists the results of running JiffyTune on a 12-way priority decode cir- 
cuit under four different conditions. The circuit contains 70 MOSFETs and the 
simulation was run for 35 ns. The tuning runs all had 64 tunable transistors, 
of which 16 were independent and 48 dependent. The 17 functions to be op- 
timized included the rising delay through four critical paths, the falling delay 
through those paths, the rise/fall times on each of the above 8 transitions and 
an area constraint. For confidentiality purposes, the delay requirement on the 
worst of the critical paths has been normalized to 500 time units in our report 
on this benchmark. The table lists the rising and falling delay of the four paths 
being tuned as predicted by AS/X on the final design (the worst of the 8 delays 
for each run is shown in bold), the total tunable transistor area of the circuit and 
the CPU time required to run JiffyTune on an IBM Risc/System 6000 model 
590. The first JiffyTune run (HOT) started from a circuit that had previously 
been manually tuned (Manual). The worst delay through the circuit improved 
by 7.5% and the area decreased by 5.0%. The second JiffyTune run (COLD) 
started from an untuned circuit in which initial transistor sizes were set to the 
same default value as they would be for a “new design.” Comparing the results 
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of (HOT) and (COLD) shows that the poor start point did not change the final 
results, but the optimizer had to work harder. The next JiffyTune run (DELAY) 
was set up to cause JiffyTune to reach the timing goal of 500 time units at all 
cost. JiffyTune was configured as in run (HOT), only with a weight on the area 
constraint that was a tenth of the previous value. The table shows that the goal 
was reached but at a high cost in transistor area. In general, we have found that it 
is important to impose an area constraint. Without an area constraint, JiffyTune 
converges to one of many equally fast circuits depending on the start point, with 
some solutions more efficient in area than others. The final run used the same 
start point and weights as HOT, but formulated the problem as a minimax opti- 
mization. A solution with a slightly higher delay but lower area was obtained in 
this case. 





Manual 


HOT 


COLD 


DELAY 


MINIMAX 


Path #1, falling delay 


555 


494 


488 


483 


497 


Path #1, rising delay 


471 


475 


473 


469 


510 


Path #2, falling delay 


535 


495 


495 


483 


506 


Path #2, rising delay 


494 


488 


488 


472 


524 


Path #3, falling delay 


561 


519 


517 


497 


544 


Path #3, rising delay 


497 


519 


519 


497 


527 


Path #4, falling delay 


497 


494 


491 


484 


516 


Path #4, rising delay 


462 


497 


496 


485 


476 


Area 


893 


844 


849 


1148 


800 


# JiffyTune iterations 


- 


9 


26 


16 


41 


Run time (CPU s) 


- 


172 


465 


289 


716 



Table 2. JiffyTune results for 12-way priority decode circuit; all delays are normalized to a 
requirement of 500 time units. 



JiffyTune in its present form is not directly applicable to designs in which 
gates are chosen from a fixed library of cells with a finite set of discrete power 
levels. JiffyTune performs well on hierarchical schematics with leaf cells con- 
taining any mix of transistors and continuously parameterized gates. In prac- 
tice, JiffyTune handles circuits containing pass transistors well, in contrast to 
optimizers based on static timing analysis, since SPECS yields electrically true 
sensitivities taking into account details of the device model such as body effect. 
As new custom circuits are designed, JiffyTune will make it possible to speed up 
the design process, make more refined designs and provide better information 
about performance trade-offs. 

7. Conclusions and future work 

In this paper we described JiffyTune, a program that optimizes circuits by ad- 
justing transistor sizes. JiffyTune makes use of fast simulation and time-domain 
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gradient computation in the circuit simulator SPECS, and advanced nonlinear 
numerical techniques in the optimization package LANCELOT. Delay, rise/fall 
time, area and power optimization have been implemented. The optimization 
system is flexible and allows ratioing of transistors and grouping of identical in- 
stances. An intuitive interface including back-annotation of optimization results 
on to the schematic has been developed. 

The environment in which a circuit will be used and the required performance 
are estimated long before the chip is built. By the time the circuit is integrated 
onto the chip, it may no longer be optimally tuned, much to the frustration of the 
design engineer. Changes in loading, changes in the specifications, changes in 
parasitics after extraction, changes in technology device models and remapping 
to a new technology are common occurrences during the course of a project. In 
such situations, retuning at the push of a button without tedious re-specification 
is extremely useful. 

JiffyTune has been successfully used to tune a number of circuits on the criti- 
cal paths of a high-performance microprocessor chip which makes liberal use of 
dynamic logic. It has been particularly useful in tuning tricky pass-gate circuits 
and has been found to enhance design re-use. Further, since the optimization 
process has been made easy and automatic for the designer, a paradigm shift 
has been observed; the issue becomes how to correctly specify the optimization 
problem rather than solving the optimization problem itself. 

There are a number of avenues for future work. “Event-driven convolution” 
is expected to speed up the computation of gradients by the adjoint method in 
SPECS. Repeated solution runs of the sensitivity or adjoint circuit are indepen- 
dent and therefore amenable to parallel processing. Occasionally, we encounter 
“non-working circuits” in the course of the optimization, when a transition to be 
measured does not occur; recovery from such situations is an interesting prob- 
lem. Extension to semi-infinite constraints [10] would allow optimization of cir- 
cuits while taking into account environment variations such as temperature and 
power supply voltage. Reformulating the problem to take advantage of group 
partial separability in LANCELOT [16, 34] would speed up the optimization. 
If the optimization could be formulated as a mixed integer/continuous problem, 
transistor ordering could be part of the optimization procedure. In addition, ap- 
plications to IC manufacturability are being considered. 
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Abstract 

Six papers were chosen to represent twenty years of research in physical simulation and anal- 
ysis, three papers addressing the problem of extracting and simulating interconnect effects and 
three papers describing techniques for simulating steady-state and noise behavior in RF circuits. 
In this commentary paper we will try to describe the contribution of each paper and place that 
contribution in some historical context. 



1. Introduction 

Research in computer-aided design (CAD) is, by its nature, application-driven. 
Algorithmic and methodological innovation has almost always been a response 
to the changing challenges faced by designers. It should therefore come as no 
surprise that many of the best papers presented at ICCAD over its twenty year 
history are associated with design issues that emerged over the same time frame. 
For example, most of the physical simulation and analysis papers nominated for 
this commemorative proceedings addressed either interconnect effects or radio- 
frequency (RF) design, two topics that have grown enormously in importance 
over the last twenty years. 

2. Inductance Extraction 

The first paper in our physical simulation and analysis commemorative col- 
lection is Efficient Techniques for Inductance Extraction of Complex 3-D Ge- 
ometries by Matt Kamon, Michael Tsuk, Christopher Smithhisler, and Jacob 
White. It was published at ICCAD in 1992 [25] and can be found on page 403. 
This was the first public description of the FastHenry 3-D inductance extractor 
and its algorithms. 

The FastHenry project started as an obvious follow-on to the FastCap 
project [42, 43]. FastCap combined the fast-multipole algorithm with iterative 



368 



THE BEST OF ICCAD 



Krylov-based linear solvers to provide an algorithm whose complexity increased 
linearly with the size of the problem. FastHenry was expected to be a relatively 
small project that applied the multipole and iterative methods of FastCap to the 
PEEC-based inductance algorithms of Ruehli [61]. The project was led by Matt 
Kamon, a student at MIT, and his advisor, Prof. Jacob White. The initial chal- 
lenge was to determine which approach should be used to formulate the mag- 
netic equations. Ruehli had already demonstrated the use of nodal analysis in 
his PEEC methods, but Kamon and White instead decided on going with mesh 
analysis. It proved to be a much better choice for several reasons, but perhaps 
most important is that it was more compatible with iterative solvers. Next, they 
applied GMRES, a Krylov-based iterative method, which appeared easy to do 
in concept, but was actually quite difficult in practice. It is vastly more diffi- 
cult to construct preconditioners for inductance than for capacitance because of 
the long range interactions that are inherent to inductance. The initial precondi- 
tioner was based on the assumption that the couplings between the mesh currents 
within an individual conductor are much tighter than those that occur between 
conductors. It was this, along with the mesh current formulation, that was ini- 
tially implemented in FastHenry and documented at ICCAD in 1992 [25]. In 
subsequent work they moved on to adapting the multipole methods to the mag- 
netic problem while continuing to improve the preconditioner [26, 27, 28, 29]. 

The FastHenry program would eventually make a large contribution to the 
research community, particularly in the area of interconnect analysis. While not 
completely general, FastHenry was able to handle a wide variety of 3-D struc- 
tures. It was not difficult to write an interface to and had sufficient capacity to 
solve very large problems by the standards of field solvers of the day. For ex- 
ample, if you were willing to wait you could use it to extract the inductance of 
relatively complex IC package lead frames. FastHenry had most of the physical 
effects that were needed, such as skin and proximity effects and provided an 
out-of-the-box solution for researchers interested in inductance. It worked, was 
reliable, was reasonably fast, was freely available, and it had little to nothing 
in the way of competition. As such, it quickly became a critical tool for those 
exploring the effect of inductance on delays in interconnect. It was used by re- 
searchers to explore and understand inductance in complex geometries, and was 
a golden reference for anyone developing models of interconnect that include 
inductance. 

The impact of FastHenry was different in character from its forerunner, Fast- 
Cap. FastCap was the first solver whose computational complexity did not in- 
crease super-linearly with the size of the problem, and so was the first truly 
high-capacity solver. However, capacitance is fairly easy to understand and it 
is easy to write solvers using a variety of methods that address the capacitance 
problem. As such, FastCap quickly had many competitors. The contribution of 
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FastCap lay mainly in the ideas it embodied and the fact that it started a wave 
of innovation in solver technology. Inductance was much trickier, especially in 
3-D. It was exceptionally difficult to write a solver that was both reliable and 
efficient, and there were few choices of algorithms. And so while the ideas 
and algorithms were less ground-breaking, the use of the program itself resulted 
in a deeper understanding of inductance by the user community and a wave of 
innovation in interconnect analysis. A remarkably large number of the papers 
on inductance that followed have used FastHenry as the benchmark reference 
[31, 32, 20, 64, 2]. 

3. Including Interconnect in Timing Analysis 

Although inductive effects have been important in integrated circuit packag- 
ing for some time, it has only recently become an on-chip concern for design- 
ers of high-performance digital circuits. However, interconnect resistance has 
been a design issue for more than two decades. Interconnect resistance was 
a problem in the early 1980’s primarily because the then available fabrication 
technology provided only one layer of metal, making it necessary to use less 
conductive polysilicon for interconnect [40]. Technology improvements soon 
provided multiple layers of metal, thereby eliminating the need to use polysili- 
con for interconnect, but these same improvements also allowed for much higher 
circuit densities. To achieve these high circuit densities, it was necessary to re- 
duce the cross-sectional area of metal interconnect, but since some signals still 
crossed the entire integrated circuit, maximum interconnect lengths did not scale 
like cross-section areas. The result was a resistance-preserving transition from 
fat interconnect in poor conducting polysilicon to skinny interconnect in good 
conducting aluminum to even skinnier interconnect in better conducting copper. 

During the 1980’s, one approach dominated for estimating the delay from 
the output of a driving logical gate, through the resistive interconnect, to the 
possibly many receiving gate inputs. The approach involved generating a model 
resistor-capacitor (RC) network, and then approximately analyzing the 
model [70]. In the model, the driving gate became a voltage source in series with 
a resistor, the interconnect became a collection of series resistors and grounded 
capacitors, and the receiving gates became grounded capacitors. The required 
delays were then estimated by approximating the step response of the model RC 
network, usually by exploiting the fact that the resistor network was almost al- 
ways a tree and then using a recursive algorithm to walk the tree and efficiently 
compute Elmore delays [60, 38, 17]. 

In present-day timing analysis programs, the techniques used to compute de- 
lays depend on the design methodology. In analyzing full-custom designs, de- 
lays are often estimated by numerically integrating the differential equations 
associated with the network formed from the driver and receiver transistors and 
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the interconnect resistors and capacitors. For designs based on cell libraries, the 
numerical integration approach may not be the most efficient and can even be 
infeasible; the designer may not have access to transistor-level descriptions of 
the cells. Instead, delays in cell-based designs are analyzed using a strategy sim- 
ilar to the twenty-year old RC model strategy described above. The two main 
differences between the current strategy and that of twenty years ago are associ- 
ated with determining the voltage source plus series resistor model for the driver 
gate, and with analyzing the resulting RC network. 

The most widely used approach for generating voltage source plus series re- 
sistor models for cell library outputs is described in [8]. This approach presumes 
that the parameters of the simplified model must depend on the true intercon- 
nect load impedance and not just on total capacitance. This concept was first 
presented in the second paper in our physical simulation and analysis commem- 
orative collection. Modeling the Driving-Point Characteristics of Resistive In- 
terconnect for Accurate Delay Estimation by Peter R. O’Brien and Thomas L. 
Savarino, which can be found on page 393. The main contribution of O’Brien 
and Savarino’s paper was to establish that the behavior of a driving gate is sub- 
stantially modified when much of the interconnect capacitance is “screened” 
from the gate output by the interconnect resistance. 

O’Brien and Savarino’s paper had an additional contribution that is more 
generic. In the years before their paper appeared, many researchers were us- 
ing moment-matching methods, but most of those papers were matching only 
the first order moments to estimate delays [60, 38, 24]. O’Brien and Savarino’s 
paper was one of the earliest to suggest that matching progressively higher or- 
der moments, which is equivalent to matching progressively higher terms in the 
Taylor series expansion of a transfer function, could be used to estimate input 
impedance. In addition, they suggested the idea that moment-matching could be 
used more generally to generate a reduced order model of the interconnect. In 
particular, they used moment-matching to determine values for the resistor and 
two grounded capacitors in a rt circuit, and then suggested that the generated 
circuit was a reduced model of the interconnect. As will be discussed in the next 
section, general model order reduction has now become a central part of strate- 
gies for handling interconnect, and O’Brien and Savarino’s early observation 
about the subject was years ahead of its time. 

4. Model Order Reduction for Interconnect 

In order to examine signal propagation and coupling effects due to on- or off- 
chip interconnect, it is usually necessary to couple an electromagnetic analysis 
of the three-dimensional interconnect with a circuit-level analysis of the transis- 
tors connected to that interconnect. Although there are techniques and commer- 
cial tools that couple time-domain electromagnetic simulation programs directly 
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to circuit simulators [67], these approaches are too computationally expensive 
to use in any but the simplest of scenarios. Instead, layout extraction tools are 
combined with model-order reduction techniques to generate low-order mod- 
els for the interconnect, and then these low-order models are included with the 
transistors in circuit- or timing-level simulation and analysis. 

When including interconnect effects in timing verification of an entire digital 
integrated circuit, the extraction-plus-reduction strategy must be fast enough to 
analyze millions of interconnect lines, but the accuracy requirements are mod- 
est. The commonly used approach is first to subdivide the interconnect into a 
large number of small sections, and then to apply a formula- or pattern-based 
algorithm to convert each of those small sections into resistors, capacitors and 
inductors. For complicated interconnect geometries, there can be a very large 
number of small sections, in which case the resulting extracted circuit will have 
a very large number of elements. Model reduction is used to generate low-order 
models of those large circuits, while still preserving the large circuit input-output 
behavior [37]. 

A very different extraction-plus-reduction strategy is used when examining 
interconnect coupling effects in packaging or for analog circuits. The objective 
of the coupling analysis is to determine if too much noise will be injected into 
a victim signal due to the proximity of the victim signal’s interconnect to the 
interconnect of simultaneously active aggressor signals. In order to capture the 
“ganging-up” effect of many simultaneously active aggressors, it is necessary to 
extract interconnect coupling terms that would be too small to be of concern in 
timing analysis; but there is no need to extract more than a few hundred intercon- 
nect lines. The modest speed and high accuracy needed to investigate packaging 
and analog circuit coupling problems has led to extraction-plus-reduction strate- 
gies in which model reduction is embedded in a three-dimensional field solver, 
and the combination directly produces reduced models [63, 5, 50]. 

Although the approach to extraction is very different in digital circuit tim- 
ing analysis versus packaging and analog circuit coupling examination, the ap- 
proach to model reduction is very similar. In both cases the interconnect is 
viewed as a linear multi terminal device. The device terminals are usually 
transistor-interconnect interface locations, but may also simply be convenient 
separation points. Regardless of application, the goal of model reduction is 
to construct a representation of the interconnect that is inexpensive to evaluate, 
yet still accurately represents terminal behavior. Finally, the form of the reduced 
model should be appropriate for circuit or timing simulation. The reduced model 
could be another circuit, as in [46, 44], or a state-space model, but not tables of 
frequency-domain data. 

The classic approach to model reduction is to select a fixed circuit topol- 
ogy, like the n model in [46], and then use some kind of fitting procedure to 
determine the element parameters. Such fitting techniques have uncertain ac- 
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curacy, and there is no way to increase the that accuracy. In the early 1980’s, 
researchers in system theory developed methods for reducing the order of state- 
space models that were, in a carefully chosen metric, provably optimal in pre- 
serving input-output behavior [22]. Although these optimal methods produced 
excellent reduced models, the computational cost of the numerical algorithms 
used in the reduction grew cubically with the number of states in the unreduced 
model, making the method much too slow to use on large interconnect problems. 

In [52], an algorithm was presented for generating low-order Padd approxi- 
mates [1] to circuit transfer functions. The algorithm was reasonably efficient 
even for large circuits, and it was expected that increased accuracy could be 
achieved by increasing the order of the Pad6 approximate. Also, since a 
order Pade approximate matches 2q-\ moments of the original transfer func- 
tion, the method could be viewed as a generalization of the moment techniques 
used in interconnect delay estimation. Early implementations of these Pade- 
based methods were sometimes unreliable when applied to interconnect prob- 
lems, and occasionally generated inaccurate and/or unstable representations of 
the interconnect. Since a cf^ order Padd approximate matches 2^- - 1 terms in 
the zero-frequency centered Taylor series expansion of the original frequency 
response, accuracy problems were mitigated by generalizing the approximate to 
match Taylor expansions at multiple center frequencies [7]. 

Using multiple center frequencies improved the robustness of the methods 
in [52, 3], but the fundamental difficulty was not made clear until the seminal 
paper by Feldmann and Freund [18] published at ICCAD 1995 (though less well 
known, [21] appeared nearly simultaneously). In [18], it was shown that many 
of the accuracy issues associated with the Padd approximates stemmed from the 
numerically unstable way in which those approximates were being computed. In 
addition, it was shown that the Pade approximate for a transfer function could 
be computed in a numerically stable manner by starting with a state-space de- 
scription, and then constructing bi-orthogonalized Krylov subspaces associated 
with the system matrix, its transpose, the input vector and the output vector. 

Once the connection was made between moment-matching and the Krylov 
subspaces, results using Krylov subspaces for model reduction appeared rapidly. 
Block full orthogonalization and bi-orthogonalization methods were developed 
to directly generate state-space representations of multiple-input multiple-output 
reduced models [63, 19], methods were generated with guaranteed stability 
properties [62], techniques were developed that allowed multiple expansion 
centers, and Krylov-subspace reduction was coupled to fast electromagnetic 
solvers [63, 50]. These algorithmic variants were then organized into a uni- 
fied theory of Krylov subspace reduction methods, the projection framework, 
which only appeared in a unique Ph. D. thesis that is required reading in the 
field [23]. 
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In the coupled circuit-interconnect simulation scenarios described above, it is 
certain that many reduced-order models will be combined to describe a larger 
system, and therefore the reduced-order models must be passive. Stability is 
insufficient because the combination of stable systems is not necessarily stable, 
but the combination of passive systems is passive. The issue of passivity in 
reduced-order models first appeared in a widely circulated but still unpublished 
manuscript [4]. In that manuscript, an algorithm was given for generating guar- 
anteed passive reduced models of RC circuits, and that algorithm was connected 
to the notion of congruence transforms in [30]. The race was then on to find a 
guaranteed passive reduction strategy for RLC circuits, and the third paper in our 
physical simulation and analysis commemorative collection is the winner of that 
race PRIMA: A Passive Reduced-order Interconnect Macromodeling Algorithm, 
by Altan Odabasioglu, Mustafa Celik, and Lawrence Pileggi (see page 433). 

The Prima algorithm was more than just first, it had an elegant simplicity 
that stemmed from combining two key steps. First, the reduction was applied 
to a circuit equation formulation that generated two positive semi-definite ma- 
trices, one operating on the state vector and one operating on the state vector’s 
derivative. Then, each of the positive semi-definite matrices were reduced with 
the same congruence transform. This two-step approach was influential because 
it became a strategy for generating passivity-preserving methods with other de- 
sirable properties, such as matching multiple moments [16] or having simple 
update formulas [37]. The two-step approach was also used to adapt the Prima 
algorithm to other applications, such as extraction from 3-D electromagnetic 
analysis [39]. 

5. Harmonic Balance 

The fourth paper in our commemorative collection is Nonlinear Circuit Sim- 
ulation in the Frequency Domain by Kundert and Sangiovanni-Vincentelli [33] 
and can be found on page 383. This paper and the follow-on journal paper [34] 
were not the first papers on harmonic balance. However, they were the first to 
attempt to apply harmonic balance to large scale circuits, and as such directly 
led to the introduction of the commercial RF simulators that are so heavily used 
today. Previous attempts focused on microwave circuits that contained a very 
small number of transistors, one or at most two, and a large number of passive 
components. This made sense for discrete microwave circuits, but was not ap- 
propriate for the monolithic microwave integrated circuits (MMICs) of the day, 
nor would it be appropriate for the coming radio frequency integrated circuits 
(RFICs). Engineers had been successfully designing discrete microwave circuits 
for years by iterating prototypes, and there was not a strong need for nonlinear 
circuit simulators from this community. However, for MMIC designers, iterat- 
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ing prototypes was both very expensive and time consuming, so they tended to 
use Spice to verify their circuits before fabrication. 

There were several problems with using Spice for microwave circuits [35]. It 
is a time-domain simulator and so has difficulty including models of distributed 
components such as transmission line structures, particularly if they include loss 
or dispersion. It is a transient-based simulator, and so is inefficient when high 
frequency carrier signals are combined with low-frequency modulation signals. 
And, it is incapable of predicting the noise of circuits such as mixers and os- 
cillators. Nonlinear Circuit Simulation in the Frequency Domain did not really 
address any of these issues directly, but rather it showed how harmonic balance, 
which was known to be an answer to the first problem, could be extended so 
that it had the capacity to be applied to integrated circuits. It was the ability to 
simulate integrated circuits that gave harmonic balance its commercial appeal. 
It was left to follow-on work to address the remaining issues. 

The increase in capacity was achieved by reducing the computational cost 
of factoring the frequency-domain harmonic balance Jacobian by using a more 
easily factored near block diagonal matrix, one which ignored some of the cou- 
pling between frequencies. This represents the second generation harmonic bal- 
ance. Many years later there was another substantial increase in the capacity of 
harmonic balance when it was combined with fast Krylov subspace based meth- 
ods [41]. These methods, which are now pervasive, also require an approximate 
Jacobian to act as a preconditioner. The approach pioneered by Kundert and 
Sangiovanni-Vincentelli continues to live on as one of a few commonly used 
preconditioners for this third generation of harmonic balance simulators. 

Nonlinear Circuit Simulation in the Frequency Domain also reported on the 
development of Harmonica, a harmonic balance simulator that was to be very 
influential; spawning several of today’s leading simulators. The name was later 
changed to Spectre when the Harmonica name was co-opted by Compact Soft- 
ware for the name of its independently developed harmonic balance simulator. 
Spectre was successful, at least in part, because it followed a formula pioneered 
by Spice. Spectre had a SPiCE-like netlist and use-model, it placed no artificial 
limits on the number of nodes or components in the circuit, and most impor- 
tantly, its source code was made freely available to anyone willing to pay a 
nominal fee. The idea that Berkeley should simply give away the source code to 
its simulators was first championed by Don Pederson, and is credited with much 
of the broad success of Spice [49]. This allowed Berkeley Spectre to become 
the foundation of what are currently the two leading RF simulators, Agilent’s 
ADS and Cadence’s SpectreRF, and a leading integrated circuit simulator. Ca- 
dence’s Spectre. It is interesting to note that even though Cadence’s simulators 
are direct descendents of Berkeley’s Spectre, neither of them currently uses har- 
monic balance. Rather, Spectre is a SPiCE-class simulator and SpectreRF uses 
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shooting algorithms that were developed as part of follow-on research to the 
original harmonic balance work. 

In another interesting aside, this paper started a spirited competition between 
the CAD community and the microwave community [56, 55]. Each produced 
papers at a relatively rapid pace on variations of, and extensions to, harmonic 
balance in an attempt to lay first claim to the resulting innovations. Even com- 
peting simulators were produced. While both communities had their successes, 
in the end it was the CAD community’s deep knowledge of numerical algorithms 
and large-scale programming techniques that produced the biggest advances and 
ultimate success. In particular, it was the CAD community that substantially ad- 
vanced the capabilities of RF simulators by enhancing shooting methods [65], 
introducing Krylov methods, and developing efficient methods for time-varying 
noise analysis [66, 58, 14]. 

6. Noise Analysis 

In 1971, Rohrer et al [57] established the noise analysis method that eventu- 
ally made its way into SPICE [45] and became the de facto standard for three 
decades. This method was accurate, robust, and efficient, but was limited in the 
type of circuits it could handle. With their approach the circuit was first lin- 
earized about the DC operating point. The noise analysis is performed on the 
resulting linear time-invariant (LTI) representation by computing the frequency 
dependent transfer functions from noise source to the output of interest. Their 
key insight, still important nearly thirty years later, is that for typical circuits 
there are many noise sources but noise is measured at only a few outputs. This 
many-input few-output case is much more efficiently analyzed using an adjoint 
formulation. 

Dramatically faster computers now allow designers to routinely simulate 
much more complicated analog circuits, with commensurately complicated noise 
spectra, possibly with multiple sharp peaks and nulls. In the approach in [57], 
the frequency-dependent output-to-noise conjugate transfer functions are com- 
puted by solving the transpose of the linearized circuit equation for each fre- 
quency of interest. In order to capture sharp features in the noise spectra, it 
would be necessary solve at a large number of frequencies. The fifth paper in 
our commemorative collection. Circuit noise evaluation by Pad4 approximation 
based model-reduction techniques by Peter Feldmann and Roland Freund (see 
page 451), cleverly applies their well-known Pade-via-Lanczos algorithm for 
model reduction [18] to the problem of avoiding the multiple frequency solves 
in noise analysis. In this approach, a rational function description for the noise 
spectra is computed directly, and therefore sharp spectral features are easily cap- 
tured. Extensions of this method can be used for computing noise in more com- 
plicated time- varying cases described below [59]. 
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Analyzing the circuit about the DC operating point is adequate for simple 
circuits such as amplifiers and passive circuits such as filters, but cannot be 
used on circuits for which the noise performance is strongly affected by large 
signals, such as with mixers, oscillators, samplers, switched-capacitor filters, 
and the like. The large, generally periodic, signals present in these circuits act 
to modulate both the noise sources and the transfer characteristics of the circuit 
from the noise sources to the output. The result is cyclostationary noise, or 
noise with time-varying statistics, at the output of the circuit. With the recent 
rise of importance of RF circuits, a more general type of noise analysis became 
essential. What was needed was the ability to perform a noise analysis not about 
the DC operating point, but about a time-varying operating point. 

Okumura et al [48] extended Rohrer’s method to support prediction of noise 
of circuits linearized about a periodic operating point. These ideas were later 
commercialized by Telichevesky [66] and others. In this way the noise of cir- 
cuits such as mixers and switched-capacitor filters could be predicted accurately. 
However, there were still a great number of fundamental questions left unan- 
swered, particularly regarding oscillator phase noise. 

Time-Domain non-Monte Carlo Noise Simulations for Nonlinear Dynamic 
Circuits with Arbitrary Excitations was published at ICCAD in 1994 by Alper 
Demir, Edward Liu, and Alberto Sangiovanni-Vincentelli. It is the last physical 
simulation and analysis paper highlighted in this commemorative collection and 
can be found on page 413. In a break from EDA tradition, Demir et al proposed 
to perform noise analysis by formulating and solving the stochastic differential 
equations for the circuit [10, 11]. The approach offered a unique advantage in 
that it did not require a periodic operating point, or even that the circuit be in 
steady state. However, his method was also perceived as being complex and 
was computationally expensive for large circuits. The added generality of the 
method was not seen as a compelling advantage in light of the disadvantages, so 
the method he proposed has not seen much use. Nevertheless, in these papers, 
Demir et al were the first in the EDA field to advocate a more rigorous approach 
to noise analysis. They were also the first to introduce to the design community 
the theoretical foundation that would be needed to address the difficult questions 
of which they were just becoming aware. 

These papers were just the first of several by Demir that used stochastic dif- 
ferential equations to deeply explore questions of noise, particularly oscillator 
phase noise [12, 13, 14, 15]. This flurry of results led, either directly or indi- 
rectly, to papers by many others that together advanced the fundamental under- 
standing of noise and improved both the analysis [58, 9, 68, 51, 69] and design 
of low-noise circuits [36, 53]. 
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7. Conclusions and Acknowledgments 

The area of physical simulation and analysis has been enlivened in recent 
years by the growing importance of problems associated with RF design and 
interconnect effects. Since many of the key papers in these fields did not initially 
appear at ICCAD, and therefore were not considered for this commemorative 
collection, we hope that this conunentary both celebrates the selected papers 
and recognizes the importance of contributions that have appeared elsewhere. 

It is not clear what emerging problems will provide the stimulation to open 
up new directions in this area, but given how often the CAD community has 
incorrectly predicted that physical simulation research has plateaued, the authors 
would like to wait and see. 

We would like to thank Joel Phillips for his insightful technical comments 
while evolving this commentary, and for providing some first-hand historical in- 
formation. In addition, we would like to thank the many contributors to physical 
simulation and analysis both at ICCAD and elsewhere. 
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Abstract 

Simulation in the frequency-domain avoids many of the severe problems experienced when trying 
to use traditional time-domain simulators such as Spice [1] to find the steady-state behavior 
of analog, RF, and microwave circuits. In particular, frequency-domain simulation eliminates 
problems from distributed components and high-Q circuits by forgoing a nonlinear differential 
equation representation of the circuit in favor of a complex algebraic representation. 

This paper describes the spectral Newton technique for performing simulation of nonlinear 
circuits in the frequency-domain, and its implementation in Harmonica, Also described are the 
techniques used by Harmonica to exploit both the structure of the spectral Newton formulation 
and the characteristics of the circuits that would be typically seen by this type of simulator. These 
techniques allow Harmonica to be used on much larger circuits than were normally attempted 
by previous nonlinear frequency-domain simulators, making it suitable for use on Monolithic 
Microwave Integrated Circuits (MMICs). 



1. Introduction 

It is common for circuits designed to operate at RF and microwave frequen- 
cies to be pseudo-linear in nature. By this it is meant that input signals are 
sinusoidal and small enough so that few harmonics are produced. This does not 
imply that the nonlinearities in the circuit can be neglected. Indeed, mixers and 
oscillators fit this description and yet they fundamentally depend on nonlinear 
effects to operate. It is also conunon for these circuits to have a large number of 
distributed components such as transmission lines, whose models often include 
loss, dispersion, and coupling effects. These distributed components are very 
difficult and often impractical to simulate in the time-domain because the partial 
differential equations that describe these structures often do not have closed- 
form solutions. In addition, time-domain simulators are not able to exploit the 
pseudo-linear nature of these circuits, and often require an excessive amount of 
time because the steady-state solution is desired. Using a time-domain simulator 
to find the steady-state solution requires that the circuit be simulated until the 
transient solution vanishes, resulting in a very expensive simulation when the 
circuit is high-Q or narrow-band. 
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Simulating these circuits in the frequency-domain avoids these problems and 
eases the problem of formulating the equations for distributed components by 
transforming the time-domain differential equations into algebraic complex equa- 
tions. The pseudo-linear nature of these circuits is naturally exploited since the 
amount of cpu time required for a frequency-domain simulation is proportional 
to the number of frequencies present. Another method has been proposed to 
find the steady-state solution [2, 3]. The shooting method, as it is often called, 
iteratively solves the circuit in the time-domain for one period; on each iteration 
the initial condition is varied, attempting to make the signals at the end of the 
period exactly match those at the beginning. The shooting method does work on 
autonomous circuits, but does not help with distributed components and is not 
capable of finding almost-periodic solutions. 

Previous efforts at nonlinear frequency-domain simulation were based on the 
use of harmonic balance to formulate the frequency-domain equations and an 
optimizer to solve them [4, 5, 6]. Using an optimizer to solve these equations re- 
sults in the number of harmonics and nonlinear devices being severely limited. 
It is possible to remove this limit by instead solving the nonlinear equations 
with Newton’s method [7]. When this is done the circuit equations can be re- 
formulated in a more natural way, and by doing so the name harmonic balance 
becomes somewhat of a misnomer. So the more appropriate name spectral New- 
ton was coined. 

2. Spectral Newton 

In order to apply the spectral Newton method, two conditions must be satis- 
fied. First, the circuit must be asymptotically stable and must have a steady-state 
solution for the given excitation; chaotic and subharmonic behavior is specifi- 
cally excluded. Second, all nonlinear devices must be lumped and their consti- 
tutive relationship must be algebraic, differentiable, and expressible in one of 
the following forms: 

i-i{v) q = q(v) i = i(<t>) q^q{<^) 

v=v{i) v = v{q) 0 = <|)(/) ^ = <^{q) 

Though not necessary, we will assume that the circuit has a periodic solution. 
Extension of these results to almost-periodic solutions is straight-forward [7]. 
For simplicity, we will further assume that a nodal formulation is being used and 
that only voltage controlled resistive and capacitive nonlinearities are allowed. 

In me time-domain a circuit can be modeled as a system of N nonlinear dif- 
ferential equations, here written in compact form as 

f{v,t) = is{t) v(0)=vo (1) 

Let U = {A|/t : 91 -4 Then veU is the vector of unknown node voltage 
waveforms; vq G 91^ is the unknown initial condition that results in the solution 
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being periodic, i.e. v(r) = v(r + To) V r; is &U is the vector of source current 
waveforms; and f :U In order to solve this system it is traditional 

to discretize it in time and apply some numeric integration method. However if 
only the steady-state response is of interest, it is possible to transform this sys- 
tem into the frequency-domain and solve it without resorting to numeric integra- 
tion. To solve the system in the frequency-domain, it is necessary to truncate the 
number of harmonics considered to a finite, and in general small, H. The trun- 
cation is analogous to discretization in the time-domain and is theoretically not 
a limitation because for all realizable circuits there exists a frequency beyond 
which there is negligible power. 

Since the nonlinear devices are lumped, f{v,t) can be rewritten as 

f{v,t) = i{v(t)) + + ^ y{t-T)v{T)dT (2) 

where i,q : 91^ 91^ are differentiable functions representing respectively the 

sum of the currents exiting the nodes due to the nonlinear conductors and the 
sum of the charge exiting the nodes due to the nonlinear capacitors; and y{t) 6 
91^ is the impulse response of the circuit with the nonlinear devices turned off.^ 
Since y is linear, the Laplace transform may be used to transform it into the 
frequency-domain, y{t) Y(s). Furthermore, since v is periodic and the circuit 
is stable ^ 

/ y{t-x)v{x)dx^^YV 
Jo 

where v V € contains the node voltage phasor for each node and each 
frequency, and Y € is a block node admittance matrix for the linear 

portion for the circuit. 



Y = [Ynn] m,ne{l,2,...,N} 

Ymn = [Y„,n{kOio,l(Oo)] k,l e {0,1,. . . ,H - 1} 

Y (k (0 /to 1 = / 

where m,n are the node indices; k,l are the frequency indices, and j = 

Since v, i and q are periodic, (1) and (2) can be transformed into the frequency 
domain by applying the Fourier series. 

F{V) = I{V) + jQQiV) +YV = ls (3) 

where is Is ^ contains the source current phasor for each node and fre- 
quency; f^F,i^I,q^Q-.C^^^ and Q. e 

Q {1,2,..., 
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a =/ ^ m^n 

\ diag{0,(0„,2o)„,...,(/f-l)co4 m = n 

and (Oo — 2%/To. 

The Newton-Raphson method is used to solve (3) for V', which requires that 
F{y) be differentiated with respect to V. However, since f{v,t) and v{t) are 
constrained to be real functions, F{V) is non-analytic, which implies that its 
derivative J{y) cannot be represented using complex numbers. To circumvent 
this problem each complex number is written as an equivalent vector in 91^. 
To perform this conversion, some more notation will be defined. Let X € C. 
Then define e 91, X € 91^ such that X^ = Re{X}, X^ = Im{X}, and 
X = [X^ X^]^. Similar notation is used for vectors and matrices. Using this 
notation, F{V),V 6 91^^^ and (3) is solved with the iteration 

y(*+i) ^ y{k) _ 7(yW)-i _ 7^] (4) 



where J G jg spectral Jacobian, i.e. 



J{V) = 



dF{V) 

dV 



W) 

dV 



+ ;Q 



d_m 

dv 



fT 



If is chosen close enough to a solution, then given certain mild conditions 
on (3), the sequence converges to that solution [9]. 

The only impediment in evaluating this expression is finding the contribution 
of the nonlinear elements to F{V) and J{V) because it is extremely difficult 
to formulate the nonlinear device equations directly in the frequency-domain. 
To avoid this problem, the node voltages are transformed into the time-domain 
and applied to the nonlinear devices. The response current of these devices 
is then calculated and converted back into the frequency-domain and added to 
F{V). Calculation of J{V) is similar except that the node voltage waveforms 
are applied to the devices’ derivative equations and the resulting waveforms are 
converted into the frequency-domain and added to J{V). The calculation of the 
spectral Jacobian will be covered in more detail in the next section. Since the 
signals are assumed to be periodic, the Fast Fourier Transform (FFT) may be 
used to perform the transformations between the frequency- and time-domains. 
If the periodic signal restriction is loosened to allow almost-periodic signals, 
then the Discrete Fourier Transform (DFT) should be used. 

Spectral Newton Algorithm 

Given: Initial guess of node voltage spectra taken from DC and small-signal AC analysis 

of circuit. 

Step 1; Convert node voltage spectra into time-domain. 

Step 2: Evaluate nonlinear devices for output current and derivative waveforms. 

Step 3: Convert the waveforms into the frequency-domain. 

Step 4: Build and solve the spectral Newton update equation (4). 

Step 5: Check F{V) and AV for convergence, if not converged, go to step 1. 
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3. Spectral Jacobian 

In our method, the spectral Jacobian is organized as the block matrix 



J{V) = 



' dFjVy 
. . 



m,n e {1,2,. 



(5) 



where are vectors of phasors, one phasor for each frequency. 

equals the sum of currents exiting node m and V„ equals the node voltage of 
node n. This block matrix is referred to as the block node admittance matrix 
because its structure is identical to the node admittance matrix. The blocks have 
the form _ 

^Fm 



. dVJUOo) . 






( 6 ) 



where Fm{V,k(ao) € 91^ is the harmonic of F„ and V„(l(Oo) 6 91^ is the 
harmonic of V„. 



dF^(V,k(Oo) 

dV„(l(Oo) 



8F'(v,too) 

dvj<(l 0 )„) 




This derivative consists of the sum of terms 



dF^VMo) _ 

dV„(l(0„) 

^^m{y ikOUff) 0 —ktiip 

dVnilOio) "^ [ ^ 



dVnilOio) 



Ymn{k(iio,lOio) 



(7) 



^ mn(k^ot^^o) — 



Y^(k(0o,l(0o) -Y^(k(Oo,l(Oo) 
^mn^kOio,l(£)o') ^mn(k^oi^^o) 



( 8 ) 



Only the calculation of will be performed, the calculation of the other 

terms in and is similar. 

oVn(lWo) aVn(lWo) 



^m{y tkdio) 



1 



To 




e-j'^o^dt 



D — I 

I^{V,k(Oo) = — irn{v{t)) cos {k(Oot)dt 

Jo 

The function v is considered implicitly to be a function of its frequency-domain 
equivalent, V ; so the chain rule can be employed to calculate the derivative. 



dI^{V,k&o) ^ 1 
dV„^l(Oo) To h 



^Vn{t) ^Vj^ilOio) 



COS {k(iiot)dt 
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Now the derivative of v„(t) is calculated. 

v„(t) = i 

k=—oo 



oo 

MO = '^n (0) + 2 Vn {k(£>0 ) COS {k&ot) ~ {koio) sin (^tOot) 
k=\ 

For / = 0, the derivative is trivial; for / ^ 0 



3v„(r) _ 


9v„(0 I 
dVMiiko) 




2cos(l©o0 


dVn(l(Ho) 


4(f) 

av'(/(o<,) J 




-2sin(/tOoO 



SoifZ#0 



KiVMo) 



2 f'^o dim{v{t)) 

— / - ^ ■ ■ ■ - cos (IdioO cos (koiot)dt 

To Jo OVn[t) 

1 f ^^m(v(t)) [.^os ((^ ^ l)(Oot) + COS {{k - l)(dot)]dt 
Tq Jo ) 



Now let Gmn(k(Oo) 6 C be the harmonic of i-c., let 



G 



mn 




9»m(v(0) 

dVn{t) 



eJkoiotdt 



(9) 



Then for / = 0 _ _ 

dIm{V,k(ao) ^ r G^„(kcOo) 0 1 

dV„(0) [ Gi,„(k(Oo) 0 

and for 1 ^ 0 

dlrniVMo) _ ( 

av„(/to«) 

■ ai,({k+l)a„} + Gl,({k-l)a,) 0L((k+IH) + 0i„((k-lH) ' 

. gL,((*+')®.)-oL((*-')0>.) oi,((k-!)w„)-ai,((k+i)<i>o) i 



This completes the calculation of the spectral Jacobian. It may now be synthe- 
sized from (5), (6), (7), (8), (9), (10) and (11). 



4. Harmonica 

We are currently developing a simulator based on the spectral Newton al- 
gorithm. Unlike previous efforts [4, 7], which were aimed at circuits contain- 
ing only one or two nonlinear devices, Harmonica is designed to quickly ana- 
lyze large circuits with many nonlinear devices. This advance is made possible 
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by using spectral Newton, by exploiting the structure and characteristics of the 
spectral Jacobian, and by exploiting the linear and almost-linear behavior of the 
devices. 

The spectral Jacobian is quite large and moderately dense, having about 4H 
elements per row or column. Naively applying sparse matrix techniques is not 
enough to solve the Newton update equation (4) efficiently. It is necessary to 
make some judicious approximations when constructing and decomposing the 
Jacobian to reduce the density of the matrix. The Jacobian is only used to gen- 
erate new iterates, and is not used when confirming convergence, so errors from 
approximations in the Jacobian only affect the rate and region of convergence, 
not the accuracy of the final solution. An approximate spectral Jacobian results 
in the loss of quadratic convergence, but the gain in efficiency more than makes 
up for this loss. 

In a node admittance matrix, any particular element is the sum of contribu- 
tions from zero or more devices. This is also true for the block node admittance 
matrix generated by the spectral Newton algorithm. In the block node admit- 
tance matrix, contributions from linear devices come as diagonal blocks, i.e. 
only the diagonal 2x2 sub-blocks are nonzero. Nonlinear devices contribute 
full blocks, however if the device is behaving almost-linearly the elements on 
the diagonal of the block are the largest and as the distance from the diagonal 
increases their magnitude decreases rapidly. This results from (10) and (11), and 
from the bandwidth of the derivative spectrum (9) being small if the device is 
behaving almost-linearly. 

The effort required to LU decompose the spectral Jacobian can be signifi- 
cantly reduced if two approximations are made. First, in those blocks that have 
contributions only from elements behaving linearly or almost-linearly, the small 
elements far from the diagonal should be set to zero and the operations that 
would normally be performed on these elements should be avoided. The deci- 
sion of which elements are small enough to ignore can be made by comparing 
the magnitude of the upper harmonics of the derivative spectrum to some small 
fraction of the DC component. The value 10~'*GOTn(0) seems to work well. Of 
those harmonics smaller than the cutoff criterion, only the first should be kept: 
all others should be set to zero. This last nonzero harmonic is called the guard 
harmonic. Second, all nonzero fill-ins that result during LU decomposition from 
operations involving the guard harmonic should be ignored. This prevents the 
bandwidth of the blocks from growing unnecessarily during the decomposition. 
These two approximations allow Harmonica to exploit linear and almost-linear 
behavior in the circuit. To get the most from them, pivoting of the block node 
admittance matrix should be done with the additional goal of exploiting the re- 
duced bandwidth of the blocks. 

The last technique used to accelerate the spectral Newton iteration is to only 
occasionally reevaluate the spectral Jacobian [9]. This works well if the Ja- 
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cobian is not changing much between iterations. It can greatly reduce the time 
required for an iteration because device evaluations and forward- and backward- 
elimination of the LU decomposed Jacobian are much faster than the decompo- 
sition of the spectral Jacobian. 

Harmonica is written in the C programming language. 

5. Results 

Execution times for Harmonica are a strong function of the number of har- 
monics simulated, the strength of the nonlinear behavior, and the number of de- 
vices behaving nonlinearly. Before applying the techniques given in the previous 
section each iteration requires 0{N^'^H^) operations. After applying those tech- 
niques, and measuring the execution times of only a few circuits, each iteration 
seems to require 0{N^'^H) operations. The iteration count remains relatively 
constant as the number of harmonics changes. 

The times for three circuits are presented in Table 1. The first two circuits 
are well-suited to simulation in the frequency-domain and poorly suited to time- 
domain simulation. With the last circuit, the roles are reversed. The first is a 
traveling wave amplifier (TWA) [10] that contains four bipolar transistors and 
ten transmission lines of noncommensurate length. Note that the transmission 
lines are constrained to be ideal by Spice, Harmonica easily handles lossy and 
dispersive lines. The second circuit contains a differential pair and a crystal 
lattice filter. This circuit demonstrates the ease with which Harmonica handles 
high-Q circuits. 

The last circuit, a simple noninverting amplifier containing a juA741, is trou- 
blesome to Harmonica because the op amp is internally acting strongly non- 
linear: the large load causing the output stage to operate class B. This exam- 
ple demonstrates that Harmonica is able to handle strongly nonlinear circuits, 
though it may run longer than traditional simulators. 

Since Harmonica is solving an algebraic system of equations, if sufficient 
harmonics are computed, it can be much more accurate than a time-domain 
simulator. This is demonstrated in all the test circuits: when Harmonica was 
able to converge with only eight harmonics computed, the maximum error in any 
harmonic was less than Ippm. Furthermore, the worst case error resulting from 
harmonics not computed was less than 20ppm. These numbers were greatly 
reduced when more than eight harmonics are computed. Spice2 computes with 
a lOOOppm error tolerance. 

6. Conclusions 

The spectral Newton method for frequency-domain simulation of nonlinear 
circuits was described along with techniques used by Harmonica to increase 
the efficiency of the method. This method allows circuits that are behaving 
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Circuit 


Conditions 


Spice2 


Hai 

hai 

8 


rmonici 

•monies 

16 
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32 


TWA 


Vout = 1 V 
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22 


56 


TWA 


Vow = 0.5 V 
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16 


40 


Filter 




2350 
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20 


94 


pA741 


Vou, = 1 V 

Ri = ooQ 


9 
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13 


29 


juA741 


Vout = 1 V 
Rl = 10 Kfl 


13 


10 


28 


63 


pA74l 


Vout = iy 
Rl= 10 KSl 


14 


NA^ 


365 


575 



Table 1. Simulation times for Spice2 and Harmonica for various circuits. Times are given in 
seconds and were measured on a VAX 1 1/785 running UNIX 4.3BSD. 



quasi-linearly to be quickly simulated, even though they may be very high-Q or 
contain many transmission lines. 

Work is being done to at least double the speed of the simulator by further 
exploiting the structure of the spectral Jacobian. 
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Notes 

1 . In function space, no one can hear you scream [8] . 

2. To turn a nonlinear device off, simply replace its constitutive equation y = f{x) with y = 0. 

3. This number is an extrapolation made from measurements of times required for smaller simulation 
intervals. The desired time interval (two periods) causes memory usage to exceed UNIX’s 16 MByte limit. 

4. This time was not measured. 

5. Circuit was behaving too nonlinearly for Harmonica to converge with so few harmonics. 
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Abstract 

In recent years, on-chip interconnect has had an increasingly important impact on overall system 
performance. Much work has been done to develop algorithms which can efficiently and accu- 
rately predict delay through on-chip interconnect. These algorithms compute a reduced-order 
approximation (usually based on the “Elmore delay”) for voltage-transfer ratios (from source to 
loads) in an RC-tree model for the interconnect. However, much less emphasis has been placed 
on accurately approximating the driving-point characteristic at the root of an RC-tree. A good 
driving-point approximation is needed to accurately predict how delay through a gate is influ- 
enced by the interconnect which that gate must drive. Macromodels for on-chip gates typically 
consider only total capacitance of the driven interconnect, completely ignoring series resistance. 
In this paper, we present an efficient algorithm which accounts for series resistance by computing 
a reduced-order approximation for the driving-point admittance of an RC-tree. Using an ECL 
clock buffer as an example, we demonstrate a significant improvement in accuracy. 



1. Introduction 

1.1 Interconnect Delay 

“Interconnect delay” has two distinct components, which can best be illus- 
trated by considering a simplified net with no branching and only one load gate 
(see Fig. 1). Let 7 ab( 0) represent delay through an unloaded source gate. Let 
Tab{L) represent delay through the same source gate loaded by an intercon- 
nect net of length L. Because of loading effects on the source gate, 7 ^b(L) ex- 
ceeds 7^b(0). We define “interconnect delay” as follows: 

Tin , = 7ac-7ab(0) 

= [TAB{L)-TABm + TBC. ( 1 ) 
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SOURCE INTERCONNECT LOAD 










Figure 1. Simplified net (for illustrating components of 



The two components are extra source gate delay (Tab{L) - Tab{0)), which is the 
focus of this paper; and propagation delay (Tbc), which has been extensively 
analyzed [2]-[4]. 

1.2 Delay Modeling 

On-chip interconnect is well modeled by RC-trees [1]. In the last several 
years, research has been active on fast algorithms for on-chip delay estimation 
and bounding using RC-tree models [2]-[6]. These algorithms have proven to 
be useful alternatives to “exact” numerical simulation (e.g., SPICE [7]), where 
computation time becomes quite large even for relatively small circuits. How- 
ever, regarding interconnect delay, the cited works [2]-[6] concentrate mainly 
on the propagation component (Tbc)- A reduced-order approximation, based on 
the “Elmore delay” [8], is computed for the voltage-transfer ratio (Vc{s) /V b(s)) 
from the source gate output to a given load gate input. See Section 2 for a brief 
review of these voltage-transfer ratio approximations. 

In this paper, we concentrate on the extra source gate component {Tab{L) - 
Tab{0)) of Tint- We compute a reduced-order approximation for the driving- 
point admittance (/b(5)/Vb(s)) seen from the output of the source gate. This 
approximation for the driving-point admittance of interconnect is independent 
of any modeling approximations made for the non-linear behavior of the source 
gate. Our approximation accounts for distributed series resistance present in the 
interconnect. Earlier works take one of the following approaches: 

1. The source gate is modeled simply with a linear output resistance. Delay 
through the loaded source gate (Tab{L)) is approximated by computing the 
Elmore delay to point B [2]-[4]. However, this is just the model source 
gate output resistance times the total load net capacitance to ground. Se- 
ries resistance present in the interconnect does not influence this delay 
calculation. 
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2. The source gate is modeled more accurately with a non-linear macro- 
model. However, the source gate output waveform (vfl(r)) is approxi- 
mated with a macromodel response which is influenced by the driven net 
only through the net’s total capacitance (again, series interconnect resis- 
tance is not considered) [5], [6]. 



2. Voltage-Transfer Ratio Approximations 



In this section, we review two commonly used approximations for a voltage- 
transfer ratio in an RC-tree. The approximation which is most widely used today 
is based on the early work of Elmore [8] and was first developed for RC-tree 
models of on-chip interconnect in [2]. In situations demanding greater accu- 
racy, a higher-order extension developed by Horowitz [5] has proven useful. 
Both approximations are influenced by the presence and distribution of series 
interconnect resistance. 

Let h{t), hEiM{t)^ and hnoR{t) denote respectively: the exact, the Elmore 
approximate, and the Horowitz approximate unit voltage impulse response at a 
given load in an RC-tree. Let H{s), Helm{s), and Hhor{s) denote the respective 
Laplace transforms of these impulse responses at the same load. The Elmore 
approximate transfer function has a single pole: 



Helm{s) = 



1 

1 -|-5Td 



( 2 ) 



The Elmore time constant (Td) is determined by matching the first moment of 
the exact impulse response: 



/*oo poo 

/ thEiM{t)dt= / th{t)dt. 

7o Jo 



(3) 



The Horowitz approximate transfer function has two poles and one zero: 



Hhor{s) = 



(H-STi)(1-|-ST2)‘ 



(4) 



The three parameters in Horowitz’s approximation (tz, Ti, and X 2 ) are deter- 
mined by matching the first two moments of the exact impulse response and the 
“sum of the open-circuit time constants’’ (i.e., the coefficient of s in the denom- 
inator of the exact transfer function): 



poo 

/ thHOR{t)dt 

Jo 

poo 

/ t^hnOR{t)dt 

Jo 



\-{-b\s-t-b2S^ ^ — 

1 -t- ('11-1-1:2)5 + 02^^4 — 




(5) 

( 6 ) 



= H{s). 



(7) 
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Figure 2. Circuit approximations for the driving-point admittance of a general RC-tree model 
for an interconnect net. 



3. Driving-Point Admittance Approximations 

In this section, we present three successively higher-order circuit approxi- 
mations for the driving-point admittance of a general RC-tree (see Fig. 2). Let 
Y (s) denote the exact driving-point admittance of the RC-tree. Within some cir- 
cle of convergence, we can represent Y {s) by its Taylor-series expansion around 

5 = 0 : ^ 

n=l 

For 1 < j < 3, the circuit approximation YAppi{s) shown in Fig. 2 matches ex- 
pansion (8) to order i. 

Let y(t) denote the inverse Laplace transform of F(5) (i.e., y{t) is the re- 
sponse current into a general RC-tree caused by an applied unit voltage im- 
pulse). Matching higher-order terms in (8) is mathematically equivalent to 
matching higher-order time moments of y(t), since 

yn = f t^y{t)dt, 

n\ Jo 



( 9 ) 



Physical Simulation and Analysis 



397 



so the approximations described in this section (for a driving-point admittance) 
are quite analogous to those described in the previous section (for a voltage- 
transfer ratio). 

The approximation Yapp \ (^) = Cs, where C = yi = Cwad and Cload is sim- 
ply the total load net capacitance to ground, is widely used. In fact, data-book 
descriptions of gates from semiconductor vendors use gate delay as a function 
of load net capacitance to characterize the drive capability of their cells. How- 
ever, we show in Section 5 that significant errors can result from ignoring metal 
resistance. 

The RC-lump approximation, 

n=l 

matches (8) to second order by setting 

C = yi ( — Cioad) 

R = -yi/yx^- 

The CRC pi-segment approximation, 

sC\ 

71=2 

provides even more accuracy by matching (8) to third order: 

Cl = yi^ly^ 

C2 = yi-{y2^/y3) 

R = 

Note that Ci -1- C 2 = yi = Cwad in the pi-segment approximation. 

We have found that either second- or third-order matching provides sufficient 
accuracy in all cases of practical interest to us, though, in principle, arbitrarily 
high orders of the driving-point admittance could be matched with appropriate 
lumped circuit models. In this section, it is assumed the necessary series expan- 
sion coefficients of Y (s) (i.e., yi , y 2 , and ys) are known. In Section 4, we present 
an efficient algorithm for computing these coefficients. 

4. Algorithm For Series Coefficients 

In this section, we present our algorithm for computing (the first three) Taylor- 
series expansion coefficients of Y (s) which are needed to perform the matching 



( 10 ) 



( 11 ) 

( 12 ) 



(13) 



(14) 

(15) 

(16) 
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to a lumped circuit approximation described in Section 3. The algorithm starts 
at the leaf nodes of an RC-tree and works back to the source in a finite sequence 
of steps. 

The algorithm consists of four rules which allow the Taylor-series expansion 
coefficients of the driving-point admittance looking downstream of a given point 
in the tree to be correctly propagated further upstream (see Fig. 3). Rules #1-3 
involve movement upstream, along a single branch, past respectively: a lumped 
capacitor to ground, a series lumped resistor, and a uniform distributed RC seg- 
ment. Rule #4 involves combining two or more different admittance expansions 
in parallel at a branch point in the tree. 




Rule #2 



Rule #3 



Yu (s) 



Yu (s) 



Rule #4 



Yp (s) 





R,C 



^S, 






^ 0 ,( 5 ) 



Figure 3. Four rules for upstream propagation of driving-point admittance expansion coeffi- 
cients. 



Ferrules #1-3, denote the known admittance expansion (looking downstream 
from the point which is immediately downstream of the circuit element to be 
traversed) by 



Yd(5) = + 



n=l 



(17) 
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Denote the unknown admittance expansion (looking downstream from the point 
which is immediately upstream of the circuit element to be traversed) by 

()'{/)« ^+o(5'‘). (18) 

n=l 



Rule #1 states that for upstream traversal of a lumped capacitor (C) to ground, 



(yc/)! 


= {yD)i+c 


(19) 


{yu)2 


= {yD)2 


(20) 


{yu)3 


= (yo)3- 


(21) 



Rule #2 states that for upstream traversal of a series lumped resistor (R), 



(yt/)i 


= (yo)i 


(22) 


{yu)2 


= {yD)2-R{yD)i^ 


(23) 


{yu)3 


= (yD)3 -2R(yo)l()'D)2 + ^^()'D)l^- 


(24) 



Rule #3 states that for upstream traversal of a uniform distributed RC segment 
(total capacitance C and total resistance R), 



(jt/)! 

{yu)2 

{yuh 



(jd)! +c 
{yD)2-R 
{yD)3 - R [ 2(y£>) 1 (y/))2 + C{yo)2 ] + 



()’o)i +C{yD)\ + -C^ 






()'£)) + ■jC^{yD)\ + 



(25) 

(26) 

(27) 



For rule #4, let fi (> 2) denote the number of branches to be combined in 
parallel. Denote the B known downstream admittance expansions by 



3 



yoiis) = 



n=l 



1 < j < fi. 



(28) 



Denote the admittance expansion of the parallel combination by 

n=l 



( 29 ) 
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Parallel admittances simply add together, and so do corresponding terms of their 
Taylor-series expansions. Hence, rule #4 states: 



B 



(}'/’) 1 


= 

1=1 


(30) 


{yph 


= X(>’a)2 

1=1 


(31) 


iyph 


= ^iyDi)3- 
1=1 


(32) 



5. Results 

To conclude, we show plots of driver delay versus metal loading. The metal 
loading we consider is an unbranched uniform distributed RC segment, which 
has a driving-point admittance of precisely 

Y (s) = ^ V sCload^load ) ) (33) 

where Cioad is the total capacitance and Rload is the total series resistance of 
the segment. 

The first-order (purely capacitive) approximation is C = Cload, which ig- 
nores resistance in the metal load. The second-order (RC-lump) approximation 
works out to be 



C = CioAD (34) 

R = i^Rload- (35) 

The third-order (pi-segment) approximation works out to be 

Cl = 2^tx>AD (36) 

6 

C 2 = -^CioAD (37) 

R = -j^RLOAD- (38) 

We compare the driver delay using each of these approximate loads to driver 
delay using the fully distributed load. 

The driver we use is an ECL differential clock buffer. The two differen- 
tial outputs are each loaded identically (by (33), or by one of its reduced-order 
lumped circuit approximations). Gate delay is measured as the crossing time of 
the signals at the buffer input to the crossing time at the buffer output. Gate 
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delay (Tab) is nonnalized to unloaded gate delay (Tab{0)). Total metal ca- 
pacitance (Cioad) is normalized to the maximum allowable load capacitance 
(Cmax\ given the sizing of our clock buffer. 

To illustrate the importance of metal resistance, we use two different metal 
widths: a high-resistance “narrow” metal and a low-resistance “wide” metal. 
Physically, wide metal is twice the width of narrow metal. For a given total 
metal capacitance, the total series resistance of a narrow metal segment is 3.2 
times that of a wide segment (not the ideal factor of 4, because of a fringing-field 
capacitance component). As can be seen in Figs. 4 and 5, progressively longer 
metal lengths require progressively higher-order lumped circuit approximations 
to accurately model the “resistive shielding” of capacitance which is located far 
away from the driver. 
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Abstract 

In this paper we describe combining a mesh analysis equation formulation technique with a pre- 
conditioned GMRES matrix solution algorithm to accelerate the determination of inductances of 
complex three-dimensional structures. Results from FASTHENRY, our 3-D inductance extraction 
program, demonstrate that the method is more than an order of magnitude faster than the standard 
solution techniques for large problems. 



1. Introduction 

In high performance VLSI integrated circuits and integrated circuit packag- 
ing, there are many cases where accurate estimates of the coupling inductances 
of complicated three dimensional structures are important for determining final 
circuit speeds or functionality, the most obvious example being the pin-connect 
structures used in advanced packaging. For the past decade, volume-element 
techniques have been used to compute self and coupling inductances of complex 
three dimensional geometries, but the techniques were intended for geometries 
which could be represented with at most a few hundred volume filaments. How- 
ever, the complex structures currently used in integrated circuit packaging can 
require up to ten thousand filaments to be accurately analyzed. Existing pro- 
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grams become extraordinarily computationally expensive for such large prob- 
lems, and new algorithms whose computational cost grows more slowly with 
problem size must be developed. 

In this paper we describe how an old equation formulation technique, mesh 
analysis, can be combined with preconditioned GMRES, a relatively new it- 
erative matrix solution technique, to make FASTHENRY, a fast 3-D inductance 
extraction program for general packaging structures. We start in the next section 
by describing a standard approach to the frequency dependent inductance and 
resistance calculation, in Section 3 we describe the mesh formulation approach, 
and in Section 4 we briefly describe GMRES. Results from FASTHENRY are 
given in Section 5, followed by conclusions and acknowledgments. 

2. Inductance calculation 

One approach to computing the frequency dependent inductance and resis- 
tance matrix, denoted Z^, associated with the terminal behavior of a collection 
of conductors involves first approximating each conductor as a set of piece- 
wise straight conducting sections. The volume of each straight section is then 
discretized into a collection of parallel thin filaments through which current is 
assumed to flow uniformly. The interconnection of these current filaments can 
be represented with a planar graph, where the n nodes in the graph are associ- 
ated with connection points between conductor segments, and the b branches in 
the graph represent the current filaments into which each conductor segment is 
discretized. 

To derive a system of equations from which the resistance and inductance 
matrix can be deduced, we start by assuming the applied currents and voltages 
are sinusoidal, and that the system is in sinusoidal steady-state. Following the 
partial inductance approach in [1, 2], the branch current phasors can be related 
to branch voltage phasors (hereafter, phasors will be assumed and not restated) 
by 



where Vj, 4 G C*, b is the number of branches (number of current filaments), 
and Z € is the complex impedance matrix given by 



where CO is excitation frequency. The entries of the diagonal matrix /? G 91*^* 
represent the dc resistance of each current filament, and L G 91*^* is the dense 
matrix of partial inductances [3]. Specifically, 



Zh = Vfc, 



( 1 ) 



Z — R~\~ 



( 2 ) 
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where Xi,Xj G 91^ are the positions in filament i and j respectively, li,lj are the 
unit vectors in the direction of current flow in filaments i and j, and Oi, aj are the 
filament cross sectional areas. 

The statement that the branch currents must satisfy Kirchhoff’s current law, 
that is, the currents entering each node must sum to zero, can be written using 
the branch incidence matrix as 



Alb = Is, (4) 

where G C” is the mostly zero vector of source currents, n is the number of 
nodes (points where conductor sections meet or a conductor terminates) exclud- 
ing any reference or ground nodes, and ^4 G 91*^" is the branch incidence matrix. 
The node voltages can be related to the branch voltages by 

A‘Vn = Vb, (5) 

where is the transpose of the branch incidence matrix, and V„ G C" is the 
vector of n referenced node voltages. Combining (5) with (4) and (1) yields 

AZ~^A% = I,. ( 6 ) 

The complex impedance matrix which describes the terminal behavior of the 
conductor system, Zr, can by derived from (6) by noting that 

ZrIs = Vs, (7) 

where 4 and Vs are the vectors of source currents and voltages. Therefore, the 
i'* column of Zr can be computed by solving (6) with an Ig whose only nonzero 
entry corresponds to 4, ) and then extracting the elements of V„ corresponding to 
V.. 

In most programs, the dense matrix problem in (6) is solved with some form 
of Gaussian elimination, and this implies that the calculation grows as b^, where 
again b is the number of current filaments into which the system of conductors 
is discretized [5]. For complicated packaging structures, b can exceed ten thou- 
sand, and solving (6) with Gaussian elimination can take days, even using a high 
performance scientific workstation. 

3. Mesh current approach 

The approach to calculating the frequency dependent inductance and resis- 
tance matrix described above has some disadvantages if (6) is to be solved with 
an iterative method. It is difficult to apply the iterative method, because the ma- 
trix contains Z“', which can only be computed by forming the dense 

matrix Z, and then somehow inverting it. Another approach to generating a sys- 
tem of equations for the currents and voltages in the network representing the 
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conductor system discretization is mesh analysis [4], and the mesh approach has 
some advantages which will be made clear below. 

To begin, mesh analysis is easiest to describe if it is assumed that the sources 
attached to the conductor system’s terminals generate explicit branches in the 
graph representing the discretized problem. Kirchhoff s voltage law, which im- 
plies that the sum of branch voltages around each mesh in the network (a mesh 
is any loop of branches in the graph which does not enclose any other branches) 
is represented by 

MVb = V, ( 8 ) 

where Vb is the vector of voltages across each branch except for the source 
branches, is the mostly zero vector of source branch voltages, and M € 
is the mesh matrix, where m = — s 4- c, s is the number of conductor sections 
and c is the number of conductors. The M matrix has the property that all of its 
nonzero entries are +1 or -1 and that most of its rows (m - c of them) have no 
more than two nonzero entries. 

The relationship between branch currents and branch voltages given in (1) 
still holds, and the mesh currents, that is, the currents around each mesh loop, 
satisfy 

M7^ = 4, (9) 

where € C'” is the vector of mesh currents. Note that one of the entries in 
the mesh current vector will be identically equal to the source branch current. 
Combining (9) with (8) and (1) yields 

MZM^Im = Vs. ( 10 ) 

The matrix MZM* is easily constructed directly. To compute the i‘^ colunrn 
of the reduced admittance matrix, Y = Z“*, solve (10) with a Vs whose only 
nonzero entry corresponds to Vs,, and then extract the entries of I„ associated 
with the source branches. 

4. Using an iterative solver 

The standard approach to solving the complex linear system in (10) is Gaus- 
sian elimination, but the cost is m? operations. For this reason, inductance ex- 
traction of packages requiring more than a few thousand filaments is consid- 
ered computationally intractable. To improve the situation, consider using a 
conjugate-residual style iterative method like GMRES [7]. Such methods have 
the general form given in Algorithm 1. 

Note that the GMRES algorithm can be directly applied to solving (10), be- 
cause the matrix MZM‘ is easily constructed explicitly. This is not the case for 
(6). Just to form (6), the Z matrix must first be inverted. This suggests that 
either some kind of nested GMRES algorithm would be required to solve (6) 
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Algorithm 1 (GMRES Algorithm for Ax = b) 
guess 

for ^ = 0,1,... until converged { 
Compute the error, r^ = b — A^ 

Find jc*"''* to minimize 

based on 3? and T, i = 0,...,k 

} 



iteratively, or the matrix would have to be expanded into the sparse tableau, and 
the GMRES algorithm applied to solving that expanded matrix. 

5. Accelerating iteration convergence 

In general, the GMRES iterative method applied to solving (10) can be sig- 
nificantly accelerated by preconditioning if there is an easily computed good ap- 
proximation to the inverse of MZM'. We denote the approximation to {MZM‘)~^ 
by P, in which case preconditioning the GMRES algorithm is equivalent to using 
GMRES to solve 

P{MZM‘)I„, = PVs (11) 

for the unknown vector Clearly, if P is precisely (MZM')~^, then (11) is 
trivial to solve, but then P will be very expensive to compute. 

A good approximation to {MZM‘)~^ that is easily computed can be derived 
by exploiting the fact that a mesh current in a given conductor is tightly coupled 
to the other mesh currents within that conductor. With an appropriate numbering 
of the mesh currents, the interactions between meshes of the same conductor can 
be clustered into blocks along the diagonal of MZM'. The inverse of the block 
diagonal matrix so generated is then an approximation to (MZM')“^ 

Preconditioning with this simple preconditioner proved extremely effective. 
Table 1 shows the average number of iterations with tol — 10“^ for a single 
solve of the pin package example described in the next section. The number 
of iterations required by GMRES without the preconditioner increased rapidly 
with problem size, but with the preconditioner, the iterations remained constant. 
This result easily makes up for the small cost of calculating the preconditioner. 

6. Results 

In this section we present our results from FASTHENRY. To test the mesh 
formulation approach, we began with two parallel rectangular wires and com- 
pared the results to those in [8]. As a more interesting example, FASTHENRY 
was used to analyze 35 pins of a 68-pin package from Digital Equipment Corpo- 
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Size of 
MZM‘ (m) 


Iterations 

WITH 

precond. 


Iterations 

WITHOUT 

precond. 


210 


8 


20 


560 


8 


38 


910 


8 


60 


1435 


8 


123 


1960 


8 


164 



Table 1. Comparison of the average number of iterations per conductor with and without the 
preconditioner. 



Filaments per 
conductor section 


Size of 
MZM' (m) 


Solution time, 
direct inversion 


Solution time, 
preconditioned GMRES 


1 


35 


0.0003 


0.007 


2 


210 


0.339 


0.147 


4 


560 


8.02 


1.08 


6 


910 


35.9 


3.08 


9 


1435 


135 


7.85 


12 


1960 


344 


14.4 



Table 2. Execution time comparison for the pin package example. Execution times are in 
IBM RS6000/540 CPU minutes. 



ration. To demonstrate the effectiveness of preconditioned GMRES, times are 
compared against direct inversion for finer and finer spatial discretization. 

For the rectangular wire problem, we considered two parallel copper wires 
with a 2 mm by 2 mm cross section and 4 mm separation between their centers. 
The problem is treated as a one conductor problem, with one wire acting as 
the return path. The data in [8] is for a two dimensional analysis with wires of 
infinite length, so for this problem the lengths were chosen to be 100 meters and 
the results scaled appropriately. 

To observe how the results varied due to skin and proximity effects, the con- 
ductor was divided into various numbers of parallel thin filaments. To follow 
the decaying nature of the skin effect, the dimensions of the filaments were 
chosen to decrease geometrically toward the outer edge of the conductor. FAS- 
THENRY was run with 1, 4, 25, and 100 filaments per wire with the results 
compared in Figures 1 and 2 against those from [8]. For each case, as the skin 
depth became much smaller than the smallest filament, the calculated resistance 
and inductance stopped changing with frequency, as expected. For relatively 
few filaments per section (e.g. 25) one can determine much of the nature of the 
impedance. Both the resistance and inductance are accurately determined up 
to around 10^ Hz. The resistance results accurately determine the knee of the 
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Figure 1. Resistance of Two Long Rectangular Wires. 



resistance curve, which shows the beginning of the skin effect. Since it is well 
known that the resistance increases as the square root of the frequency after the 
knee, higher frequency resistance values could easily be extracted. 

Thirty-five pins of a 68-pin package from DEC, shown in Figure 3, proved 
a good test of the utility of FASTHENRY. Each pin consists of five conductor 
sections. Neglecting skin effects and choosing one filament (or branch) per 
section gives a system of 175 filaments and 35 meshes. Notice that for mesh 
analysis, the solution is obtained by solving only a 35x35 system, while for the 
branch incidence matrix approach as in (6) inversion of a 175x175 system is 
necessary. 

To accurately model skin and proximity effects, each conductor section is di- 
vided into multiple filaments. As the discretization is refined, the size of the 
problem grows quickly. For these problems, the advantage of the GMRES algo- 
rithm becomes apparent (see Table 2). For twelve filaments per section, GMRES 
is already 23 times faster than direct inversion. In fact, the number of operations 
required by GMRES grows only as m^, while the direct method grows as m^. 

Notice that twelve sections per filament is barely enough to observe skin ef- 
fects, however memory requirements limit any finer discretizations. It is worth 
noting that future work to implement multipole algorithms [9] will further re- 
duce both the computational costs and memory requirements, thus allowing finer 
discretizations. 



L(nH/m) 
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Figure 2. Inductance of Two Long Rectangular Wires. 




Figure 3. Half of a pin-connect structure. Thirty-five pins shown. 
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In this paper, we do not give a comparison to the branch incidence approach in 
(6) since it would require inverting Z, the branch impedance matrix. Z is always 
larger than MZM‘ and would thus be more expensive than direct inversion of 
MZM' which is shown in Table 2. 

7. Conclusions and acknowledgments 

It is shown in this paper that if mesh rather than nodal analysis is used to 
form the system of equations which must be solved to determine inductance, 
then the equations can be solved easily with the iterative GMRES algorithm. 
Addition of the preconditioner to GMRES can reduce the cost of solution to 
operations compared to for direct inversion. Results from FASTHENRY, our 
3-D inductance extraction program, demonstrate that the iterative approach can 
accelerate solution times by more than an order of magnitude. 

Future work using multipole algorithms will exploit the fact that the off- 
diagonal elements of Z are the partial inductances generated from integrals of 
i. Such methods will avoid forming and storing most of the entries in the dense 
matrix MZM‘, and reduce the cost of calculating matrix-vector products required 
for the GMRES procedure to order b operations. 

Currently, FASTHENRY is being extended to include ground planes. Results 
will be presented at the conference. 
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and Joel Phillips for their help in understanding inductance. In addition, the 
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Abstract 

A new, time-domain, non-Monte Carlo method for computer simulation of electrical noise in 
nonlinear dynamic circuits with arbitrary excitations is presented. This time-domain noise sim- 
ulation method is based on the results from the theory of stochastic differential equations. The 
noise simulation method is general in the sense that any nonlinear dynamic circuit with any kind 
of excitation, which can be simulated by the transient analysis routine in a circuit simulator, can be 
simulated by our noise simulator in time-domain to produce the noise variances and covariances 
of circuit variables as a function of time, provided that noise models for the devices in the circuit 
are available. Noise correlations between circuit variables at different time points can also be 
calculated. Previous work on computer simulation of noise in integrated circuits is reviewed with 
comparisons to our method. Shot, thermal and flicker noise models for integrated-circuit devices, 
in the context of our time-domain noise simulation method, are described. The implementation of 
this noise simulation method in a circuit simulator (SPICE) is described. Two examples of noise 
simulation (a CMOS ring-oscillator and a BIT active mixer) are given. 



1. Introduction 

This paper presents a new, time-domain, non-Monte Carlo method for computer 
simulation of electrical noise in nonlinear dynamic circuits with arbitrary exci- 
tations. This time-domain noise simulation method is based on the results from 
the theory of stochastic differential equations. The noise phenomena consid- 
ered in this work are caused by the small current and voltage fluctuations that 
are generated within the integrated-circuit devices themselves. The existence 
of noise is basically due to the fact that electrical charge is not continuous but 
is carried in discrete amounts equal to the electron charge. Electrical noise is 
associated with fundamental processes in integrated-circuit devices [1]. Noise 
represents a lower limit to the size of electrical signal that can be amplified by 
a circuit without significant deterioration in signal quantity. It also results in an 
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upper limit to the useful gain of an amplifier, because if the gain is increased 
without limit, the output stages of the circuit will eventually begin to cut off 
or saturate on the amplified noise from the input stages [1]. The influence of 
noise on the performance is not limited to amplifier circuits. For instance, active 
integrated mixer circuits, which are widely used for down conversion in UHF 
and microwave receivers, add noise to their output. It is desirable to be able to 
predict the noise performance of a given mixer design [2, 3]. Most of the time, 
amplifier circuits operate in small-signal conditions, that is, the operating point 
of the circuit does not change. For analysis and simulation, the amplifier circuit 
with a fixed operating-point can be modeled as a linear time-invariant network 
by making use of the small-signal models of the integrated-circuit devices. On 
the other hand, for a mixer circuit, the presence of a large local-oscillator signal 
causes substantial change in the active devices operating points over time. So, 
a linear time-invariant network model is not accurate for a mixer circuit. There 
are many other kinds of circuits which do not operate in small-signal conditions, 
such as a volt-age- controlled-oscillator (VCO) composed of delay cells in a ring 
configuration. Noise simulation of these circuits requires a method which can 
handle nonlinear dynamic circuits with arbitrary excitations. The three impor- 
tant types of noise in integrated circuits are shot noise, thermal noise and flicker 
noise which will all be considered in this work. 

In Section 2 below, previous work on computer simulation of noise in inte- 
grated circuits is reviewed with comparisons to our method. In Section 3, shot, 
thermal and flicker noise models for integrated-circuit devices, in the context of 
our time-domain noise simulation method, are described. Section 4 describes 
our noise simulation method. In Section 5, the implementation of the noise sim- 
ulation method, in the context of a nodal-analysis circuit simulation program 
(SPICE), is described. Two examples of noise simulation are presented in Sec- 
tion 6. Finally, future work is stated in Section 7. 

2. Previous work 

The electrical noise sources in passive elements and integrated-circuit devices 
have been investigated extensively. Small-signal equivalent circuits, including 
noise, for many integrated-circuit components have been constructed [1]. The 
noise performance of a circuit can be analyzed in terms of these small-signal 
equivalent circuits by performing sinusoidal circuit analysis in frequency do- 
main in the usual fashion. This analysis is done separately for each of the uncor- 
related noise sources, and for a range of frequencies. For a complicated circuit, 
the large number of noise sources and circuit complexity completely preclude 
hand calculation. In fact, even machine computation of the noise contributions 
from all noise sources can be time consuming. Fortunately, an extremely ef- 
ficient computational technique, based on the inter-reciprocal adjoint network 
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concept, was proposed [4, 5]. This technique calculates the noise contribution 
from an arbitrarily large number of noise sources at a given frequency with little 
more computer time than is normally required for a single noise source. The 
noise analysis in SPICE is based on this method. Unfortunately, this method is 
only applicable to linear time-invariant circuits (e.g. the small-signal equivalent 
circuits corresponding to circuits with fixed operating points). It is not appro- 
priate for noise simulation of circuits with changing bias conditions, or circuits 
which are not meant to operate in small-signal conditions. 

[2, 3] and [6] present noise analysis techniques for nonlinear circuits with a 
periodic large signal excitation. The noise analysis for a nonlinear circuit with 
a periodic large signal excitation reduces to the analysis of a linear periodi- 
cally time- varying circuit with cyclostationary [2, 3] [6] noise sources. This is 
arrived by a first-order Taylor’s expansion of the circuit equations around the 
periodic steady-state solution of the circuit without the noise sources and the 
small-signal excitations. This Taylor’s approximation is similar to the one we 
will present in Section 4.1. The noise analysis methods described in [2, 3] and 
[6] use frequency-domain methods based on manipulating impulse responses 
and transfer functions for a linear periodically time-varying system, and spec- 
tral densities for cyclostationary noise sources. These noise analysis techniques 
are applicable to only a limited class of nonlinear circuits with two excitations, 
where one of the excitations is large and periodic and the other is small (e.g., 
mixer circuits, switched capacitor circuits). The previous work on noise simu- 
lation in time-domain is restricted to techniques which employ the Monte Carlo 
method [7]. This method has several drawbacks. Pseudo-random number gener- 
ators often do not generate a large sequence of independent numbers, but reuse 
old random numbers instead. This becomes a problem if a circuit with many 
noise sources is simulated. This is usually the case, because every device has 
several noise sources associated with its model. In this method, the same circuit 
is simulated many times by obtaining “different” sample paths for each noise 
source. Then a statistical analysis is carried out to calculate averages and vari- 
ances over these many simulations. The noise content in a waveform will be 
much smaller when compared with the magnitude of the waveform itself. As 
a result, the waveforms obtained for different sample paths of noise generators 
will be very close to each other. It is known that, in a simulator, these wave- 
forms are only numerical approximations to the actual waveforms, therefore 
they contain numerical noise. The RMS value of noise is calculated by taking a 
difference of these waveforms. That is, two large numbers, which have uncer- 
tainty in them, are being subtracted from each other. Consequently, the RMS 
noise calculated with this method, in fact, includes the noise generated by the 
numerical algorithms. This furthermore degrades the accuracy of the results ob- 
tained by this method. This method has one advantage when compared with 
the frequency domain methods discussed above: It is not restricted to linear 
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time-invariant, or to nonlinear circuits with a large signal periodic excitation. In 
theory, it is applicable to the general class of nonlinear dynamic circuits with 
any kind of excitation. 

Our method, unlike the frequency domain methods, is not restricted to linear 
time-invariant or nonlinear circuits with a large signal periodic excitation. Our 
time-domain noise simulation method is based on the results from the theory 
of stochastic differential equations. There are no pseudo-random number gen- 
erators involved in the simulation, therefore the problems associated with them 
do not exist. The simulation of the average waveforms (without noise in the 
circuit) and the simulation of noise are separated, even though they are done 
concurrently. Thus, the numerical noise problem that arises in Monte Carlo 
methods is avoided. Our method is capable of calculating variances and covari- 
ances (that is, the covariance matrix) for the noise content in the node voltages 
and other circuit variables in a circuit as afunction of time. Furthermore, corre- 
lations between circuit variables at different time points can also be calculated. 
Finally, the implementation of our method fits naturally into a circuit simulator 
(such as SPICE) which is capable of doing time-domain transient simulations. 
Noise simulation is done along with the transient simulation over the time inter- 
val specified by the user. 

3. Noise models 

The electrical noise sources in passive elements and integrated-circuit devices 
have been investigated extensively, and appropriate models have been derived 
[1,8]. Traditionally, these noise models are presented as stationary noise sources 
in the small-signal equivalent (at an operating point) circuits of the devices [1]. 
In this section, we describe the adaptation of these noise models for use in our 
time-domain noise simulation method. In our method, the noise sources are 
inserted in the large-signal models of the integrated-circuit devices and they are, 
in general, non-stationary. In Section 3.1, the adaptation of shot, thermal and 
flicker noise models for resistors and junction diodes will be described. The 
noise models for these two simple devices are representative of noise models 
for all other integrated-circuit devices such as BJTs and MOSFETs, because all 
kinds of noise we consider (shot, thermal and flicker noise) exist in these devices 
[16]. The noise source models we use in our method are adapted from [1] and 
[8]. As it will become clear in Section 4, our noise simulation method requires 
that noise sources are white. The thermal and shot noise sources are modeled 
as white noise sources, hence they can be directly included in the simulation. 
However, the flicker noise sources can not be included in the simulation as they 
are. The inclusion of flicker noise sources into the noise simulation method will 
be described in Section 3.2. 
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3.1 Shot, thermal and flicker noise models 

3.1.1 Resistors. Monolithic and thin-film resistors display thermal 
noise. The thermal noise in a resistor can be modeled by a white Gaussian noise 
current source with intensity 



Wl™»l = V2*7'/'! (1) 

where k is Boltzmann’s constant, T is the absolute temperature and R is the 
resistance [1]. The thermal noise source associated with a resistor is a stationary 
white noise process, assuming that the resistance value is a constant as a function 
of time. The intensity of a stationary white Gaussian noise process is equal to 
the square root of the power spectral density. For a stationary white Gaussian 
noise process, the power spectral density (a function of frequency) is a constant 
on the entire real axis. 

3.1.2 Junction diodes. The series resistance , in the model of a 
junction diode [1], is a physical resistor due to the resistivity of silicon, hence it 
exhibits thermal noise. The thermal noise in can be modeled as in Section 3.1.1. 
The exhibits shot noise which is associated with the current flow through the 
diode. The intensity of the shot noise current, which is white Gaussian, is given 
by 

INL = ( 2 ) 

where q is the electronic charge (1.6 x 10“*®) and /^(r) is the noiseless diode 
current. Note that, in this case, intensity is a function of time, hence this white 
noise source is not stationary. The square of the time-varying intensity for a 
non-stationary white noise source as above can be thought to be the time- varying 
power spectral density, which is a constant (as a function of frequency) on the 
entire real axis. During nonlinear operation, the current through the diode shows 
variations as a function of time, so does the intensity. In this way, shot noise 
associated with a time-varying current is modeled as a non-stationary white 
Gaussian noise, which is also the case for thermal noise associated with a time- 
varying resistance. (1) is also valid for a time- varying resistance [16]. The 
flicker noise source in a diode is modeled by a non-stationary noise process 
which has a time- varying power spectral density given by 

S%,^er=KFlD{tr/f (3) 

where KF is a constant for a particular device, a is a constant in the range 0.5 
to 2 and / is the frequency. This noise source can not be included in the noise 
simulation directly, because it is not white (i.e. the time- varying power spectral 
density is not a constant as a function of frequency). A way of synthesizing this 
source from white noise sources will be discussed in Section 3.2. 
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3.2 Flicker noise sources 



In our noise simulation method, only white noise sources are allowed. Flicker 
noise sources have a power spectral density which is not a constant as a function 
of frequency. The natural way to include flicker noise sources into simulation is, 
somehow, to synthesize them using white noise sources. A promising approach 
for 1// (flicker) noise generation is to use the summation of Lorentzian spectra 
which is defined by (4) [9]. It has been shown that a constant distribution of 1.4 
poles per decade gives a 1// spectrum with less than 1% error [9]. A sum of N 
Lorentzian spectra is given by 



S{f) = 



2a^ ^ 



(4) 



where O^s designate the pole-frequencies and / is the frequency. It has been 
shown in [9] that IV = 20 poles uniformly distributed over 14 decades are suf- 
ficient to generate 1// noise over 10 decades with a maximum error less than 
1%. Each Lorentzian spectrum in the summation in (4) can be easily obtained 
by using the thermal noise generator of a resistor Rh connected in parallel to a 
capacitance Ch = C, and their sum can be achieved by putting N of such Rh - Ch 
groups (Figure 1) in series [9]. In the noise simulation, a flicker noise source in 



R1 




RN 



Figure 1. Noise Synthesizing Circuit. 



the model of an integrated- circuit device is built by using the circuit in Figure 1 
with an ideal voltage-controlled current source. This is illustrated in Figure 2. 
The voltage-controlled current source is connected between the two nodes of a 
device where the flicker noise source is modeled. The spectral density of the 
noise obtained from the circuit in Figure 1 is approximately 

S{f) = 2cp-/nf 



(5) 
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Figure 2. Flicker Current Noise Source Synthesis. 



where — kT / (2C). This spectral density is time-invariant. The flicker noise 
model given in Section 3.1.2 requires a time- varying spectral density. This 
is achieved by having a time-varying transconductance (g(t)) for the voltage- 
controlled current source in Figure 1. For instance, for a diode, we require that 
the flicker noise source spectral density is in the form given by (3). This is 
assured with 



4. Development of the simulation method 

The noise simulation method will be described assuming that modified nodal 
analysis (MNA) [10] is used for the formulation of circuit equations. MNA 
is the method for circuit equation formulation in most of the circuit simulators 
(such as SPICE) available. Translation of the noise simulation method into other 
ways of circuit equation formulation is straightforward. 

4.1 Derivation of the stochastic differential 

equation for noise from MNA formulation of 
the nonlinear circuit equations 

The MNA equations for any circuit, without the noise sources, can be written 
compactly as 



where x is the vector of the circuit variables with dimension n, x is the time 
derivative of x, t is time and F is mapping x, x and t into a vector of real numbers 
of dimension n. The time dependence of x and x will not be written explicitly 
for notational simplicity. In MNA, the circuit variables consist of node voltages 
and some branch currents, e.g. currents through inductors and voltage sources. 
The circuit equations consist of the node equations (KCL) and branch equations 
of the elements for which branch currents are included in the circuit variables 
vector. Under some rather mild conditions (which are satisfied by well modeled 
circuits) on the continuity and differentiability of F, it can be proven that there 
exists a unique solution to (7) assuming that a fixed initial value x(0) = xq is 
given [10]. Let Xs be the solution to (7). Transient analysis in circuit simula- 




( 6 ) 



F{x,x,t) = 0 x{0) = xo 



(7) 
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tors solves for Xg using numerical methods for ordinary differential equations 
(ODEs) [10]. The initial value vector is obtained by a DC analysis of the circuit 
before the transient simulation is started. For a circuit, there may be several 
different DC operating-points. 

The first-order Taylor’s expansion of F around Xs is expressed as 



which will be used later. 

If the noise sources are included in the circuit, the MNA formulation of the 
circuit equations can be written as 



where B{x, t) is an n x p matrix, the entries of which are a function of x, and v is 
a vector of p standard white Gaussian stochastic processes. A one-dimensional 
standard Gaussian white noise is a stationary Gaussian process ^(/), for — °o < 
t <oo, with mean £[^(0] = 0 and a constant spectral density on the entire real 
axis. The autocorrelation function of 4(0 is given by £[4(t -1- t)4(0] = 5(t), 
where 6 is Dirac’s delta function [11]. 

The white Gaussian noise 4(0 is a very useful mathematical idealization for 
describing random influences that fluctuate rapidly and hence are virtually un- 
correlated for different instants of time. A white Gaussian noise model is appro- 
priate for thermal and shot noise in integrated circuits [1]. Flicker noise sources 
are taken care of in the way described in Section 3.2. v in (9) is simply a com- 
bination of p independent one-dimensional white Gaussian noise processes as 
defined above. These noise processes actually correspond to the current noise 
sources which are included in the models of the integrated-circuit devices. Since 
the noise models for the integrated-circuit devices are to be employed here in the 
context of an MNA circuit simulator (SPICE), noise sources in the devices are 
all modeled as uncorrelated current sources. 

B{x,t), in (9), contains the intensities, as described in Section 3.1, for the 
white noise sources in v. The intensities for these noise sources are, in general, 
a function of time (not a constant). Because of intensity variations, these noise 
sources are not stationary. Thus, the non-stationarity of the noise sources in the 
circuit are captured in B{x,t). Every column in B{x,t) corresponds to a noise 
source in v and has either one or two nonzero entries [16]. 

(9) is a system of nonlinear stochastic differential equations (SDEs) where 
the forcing is an irregular stochastic process (white noise). This kind of SDEs 




( 8 ) 



F{x,x,t) +B{x,t) V = 0 x{0) = Xo 



( 9 ) 
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require fundamentally different and complex methods of analysis and numerical 
solution [12]. Fortunately, some characteristics of our problem help us simplify 
the numerical solution of (9): The noise content in the signals in any useful 
circuit is, almost always, much smaller when compared with the signal itself. 

Let Xsn be the solution of (9). is a vector of stochastic processes, since 
it is the solution of the circuit equations with the noise sources included, and 
satisfies 

P'i.^sni^snt^') “I" V = 0 Xjn(O) = Xq 4 * Xfioisefi (10) 

where xq is deterministic, and x„oi>,o is a vector of n zero-mean random vari- 
ables. We use (8) in (10) to approximate F(xs„,Xs„,t), and obtain 



F*(x5,X5,r) -l- ^^F(x,x,t) 



X = X^ 
X = Xc 



-h^^F(x,x,0 



y _ y (4/1 - 4) + B(xs„,t) V ^ 0 (11) 

— A j 



X = Xc 



^sn(0) — Xq -|- Xfioisefi 



Define 



Xnoise — ^sn ^s- 



( 12 ) 



Xnoise is the difference between the solutions of the circuit equations, with and 
without the noise sources. In other words, Xnoise is the noise content in Xsn- Xnoise 
is much smaller when compared with x^, which validates the above approxima- 
tion. 

For notational simplicity, define 






C{,) = yJ{x,x,,) ( 13 ) 



X = Xo 



x = x. 



where A{t) and C{t) are nxn matrices with time-dependent entries. Further- 
more, we approximate 

B{xs„,t) = B{xs,t) (14) 

and define 

B{t)=B{xs,t). (15) 

If (12), (13), (14) and (15) are substituted in (11) we obtain 

f’(4,Xj, t) -|- A(t) Xnoise 4" C{t) Xnoise 4" B(t) V = 0 (16) 

Xnoise (^) — .^0 4" Xnoise, 0 ~ Xs(0). 
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Since Xs is the solution of (7) we have 

F{XsiXs, t)=0 Xs{0) = Xo 

and if we substitute (17) in (16), we obtain 

•'^(0 Xfioise "t" C{t) Xfioise "t" ^(0 V = 0 
~ ^noisefi- 

(18) is a linear SDE [11] in Xnoise with time- varying coefficient matrices. A{t), 
B{t) and C{t) are functions of Xs, and they do not depend on Xnoise- The solution 
of this equation will be discussed in the next four subsections. 



(17) 

(18) 



4.2 Transformation of the stochastic differential 
equation for noise into state-equation form 

To make use of some of the results from the theory of SDEs, (18) will be put 
into the form 

y = E{t)y + F{t)v y(0)=yo- (19) 



If C(r) is a full-rank matrix, this can be easily done by premultiplying both sides 
of (18) by the inverse of C{t). However, this is not true in general; C{t) may 
have zero rows and columns. For instance, if a circuit variable is a node voltage, 
and if this node does not have any capacitors connected to it in the circuit, then 
all of the entries in the column of C{t) corresponding to this circuit variable will 
be zero for all t. Also, the node equation (KCL) corresponding to this node will 
not contain any time-derivatives, hence the row of C{t) corresponding to this 
node equation will be zero for all t. Some of the rows and columns of C{t) are 
structurally zero, independent of t. Moreover, the number of zero rows is equal 
to the number of zero columns. If we reorder the variables in x„oise in such a way 
that the zero columns of C{t) are grouped at the right-hand side of the matrix, 
and reorder the equations in such a way that the zero rows of C{t) are grouped 
in the lower part of the matrix, (18) becomes 



^12(0 
A2l{t) A22{t) 



^noise 

2? ■ 
’^noise J 



-1- 



Cii(r) 0 
0 0 



•^noise 

2 ? ■ 



Exit) 

Blit) 



v = 0 



^noisefi 
'^noisefi . 

( 20 ) 

where An(r) and Cn{t) are m x m, ^ 22(0 is kxk, Ai 2 (t) ismxk , A 2 \{t) is 
kx m, B\{t) is mx p , B 2 {t) is kxp, x\gi^^ is an m-dimensional vector, 
a )t-dimensional vector, m is the number of nonzero columns (rows) in C(t) and 
k is the number of zero columns (rows). Naturally, n = m-l-k. Then, expanding 



\oise (®) 
^noise (®) 
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(20) and performing straightforward operations on this equation [16], we arrive 
at the SDE for noise in the state equation form, which is given by 



^noise ^(0 ^(0 ^ ^noisei^^ ^noisefi 

^noise (0 ^noise ^ 2(0 
4oise,0 = (0) + ^2(0) V(0) 



with 



’^noise 

^ ■ 
•^noise J 



= Xn 



(reordered) 



( 21 ) 

( 22 ) 

(23) 



Here, E{t) is m x m, F{t) is mx p, Di{t) is kxm, D 2 {t) is kxp and they 
are obtained from An(/), A\ 2 {t), A 2 \{t), A 22 (t), Cn(t), and fiaW t>y 
performing some matrix algebra operations [16]. 



4.3 Solution of the stochastic differential 
equation for noise 

(21) is a linear differential equation where the forcing is an irregular stochas- 
tic process which is white noise. A mathematically rigorous treatment of equa- 
tions of this type requires a new theory. In 1951, Ito defined the Ito or stochastic 
integral and in doing so put the theory of SDEs on a solid foundation [11]. 
(21) is written symbolically as a linear SDE, but it is interpreted as an integral 
equation with Ito or Stratonovich stochastic integrals [11]. The solution of (21) 
obtained by the Stratonovich interpretation is equal to the one obtained by the 
Ito interpretation, because it is a linear SDE in the narrow sense [1 1]. A detailed 
explanation of Ito and Stratonovich stochastic integrals and stochastic differen- 
tial equations can be found in [11], [12] and [13]. In the following development, 
we state and use some of the results from the theory of SDEs. (21) is often 
written in the form 

(^^Lise = E{t) xl,i,, dt + F{t)dw xl,i,, (0) = (24) 

where w is a vector of p independent one-dimensional Wiener processes. A 
p-dimensional Wiener process can be defined as a process with independent 
and stationary, N(0, (tj - t 2 )/p) -distributed increments w{t\) - w{t 2 ), with initial 
value w(0) = 0. Here, N{Mean,Cov) denotes the p-dimensional normal distri- 
bution with expectation vector Mean and covariance matrix Cov [1 1]. A Wiener 
process can be thought to be the “integral” of a white noise, or, alternatively, 
white noise is the “derivative” of a Wiener process in the sense of coincidence 
of the covariance functionals [11]. In our case, we have 

w(r) = f vi%)dx vit) — wit) 

Jo 



( 25 ) 
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As with ordinary differential equations, the general solution of a linear SDE 
can be found explicitly. The method of solution also involves an integrating 
factor or, equivalently, a fundamental solution of an associated homogeneous 
differential equation. The solution of (21) is given by 

xloM = dw{x) (26) 

Jto 

where d>(t,t) is the matrix determined as a function of t by the homogeneous 
differential equation 



d^ldt = E{t)<I> , d>(T,T)=V (27) 

(26) involves an Ito integral as opposed to a Riemann integral [11]. The integral 
in (26) can not be interpreted as an ordinary Riemann integral, because almost 
all sample functions of w{t) are of unbounded variation. Ito’s definition of the 
stochastic integral includes the ordinary Riemann integral as a special case [11]. 
If the functions E{t) and F{t) are “measurable” and bounded on the time interval 
of interest, there exists a unique solution for every initial value x\gi^g{0) [1 1]. We 
are interested in the case where 

( 28 ) 

In our problem, it is sufficient to find the probabilistic characteristics of as 
a function of t. In other words, we would like to determine the mean and the 
covariance matrix of as a function of time in the time interval desired. If 
^noise 3 Gaussian stochastic process, then it is completely characterized by its 
mean and covariance function as a function of time. Further explanation on this 
topic will be given in Section 4.5. If we substitute (28) in (26) with to = 0 we 
obtain ^ 

4otJt) = ‘l>(t,0)x^ + ^ <I»(t,T)F(T)dw(x). (29) 

If we take the expectation of both sides of (29) we get the mean of which 

is a function of t. Noting that £[v(t)] = 0 and o] = 0, we get 

m\t) = ‘E[xl^i,,{t)] = 0. (30) 

Next, we would like to determine the correlation matrix of the components of 
x\oise ^ function of t, which is given by 

(t) = (31) 

Consider the differential 

d XffoiggX^l^g ^noise dXfioige {dXfioise) ^noise '^'^(0^(0 dt. (32) 
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Notice that there is an extra term in (32) which would not be there if we were 
using ordinary calculus instead of stochastic (Ito) calculus. This equation is 
obtained from Ito’s Theorem on stochastic differentials [11] using (24). We use 
(24) to expand (32) and obtain 



dx^ . X* . 

^ '^noise’^noise 



= {E{t)x\ 



X 

noise '^noise 



I 



X . 

noise "^noise 



E{tf)dt + 



F{t) F(ty dt +xl,i,,{F{t) dwY + {F{t) dw)xl 



(33) 



If we take the expectation of both sides of this equation, noting that and 
dw are uncorrelated and using (31), we get 

k\t)=E{t)K\t) + K\t)E{tf + F{t)F{tf (34) 

where (t) is the unique synunetric nonnegative-definite solution of the matrix 
equation (34) with the initial value ^*(0) = q ~ ^o- Calcula- 

tion of the initial value Kq will be described later. The differential equation for 

K^{t) = K^{t)^ , (34), satisfies the Lipschitz and boundedness conditions in the 
time interval of interest, so that a unique solution exists [11]. (34) represents (in 
view of symmetry of (t)) a system of m{m + 1)/2 linear ordinary differential 
equations. (34) can be solved for using a numerical method (such as Backward 
Euler) for the solution of ODEs. 

{t) represents the noise correlation matrix of circuit variables as a function 
of time. So, the information about the noise variances of circuit variables, or the 
noise correlations between circuit variables at a given time point are contained 
in K^{t). In some problems, one might be interested in the noise correlations of 
circuit variables at different time points, which can be expressed as 

K^tidz) = E[xl^i,,{ti)xl,i,,{t2f]. (35) 



In a similar way to the derivation of (34), one can derive 

■^^K\tut2) = K\tut2)E{t2f (36) 

with the initial condition = K^{ti) [13]. Integrating (36) at various 

values of ti, one can obtain a number of sections of the correlation function 
(ti , t2) at t2>t\. Then, {t\ , ^ 2 ) at t2 < t\ is determined by 

K\tuti)=K\t2jxf . 



(37) 
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4.4 Calculation of the initial value for the linear 
ODE for the covariance matrix of the 
components of 

In the previous section, we have derived a linear ODE, (34), for the correla- 
tion matrix of In order to be able to solve (34), we need to know the initial 
value We set Kq to the solution of the following matrix equation in P 

E{0)P + PE{0f + F{0)F{0f = 0. (38) 

The matrix equation (38) has a symmetric nonnegative-definite solution P, if 
the equation z = E(0)z is asymptotically stable (that is, if all the eigenvalues of 
E(0) have negative real parts) [11]. (38) represents (in view of symmetry of P ) 
a system of m(m +l)/2 linear equations. 

It is interesting to analyze the special case of noise simulation when the circuit 
is linear time-invariant, or nonlinear dynamic with DC excitations [16]. In this 
case, noise simulation reduces to solving the linear equation system (38) [16]. 

4.5 The condition for to be Gaussian 

The noise in the circuit (solution of (21)) is a Gaussian stochastic process 
if and only if the initial value Q is normally distributed or constant [11]. 
Up to this point, we have characterized the initial value q being an m- 
dimensional vector of zero-mean random variables with the covariance matrix 
given by the solution of (38). Here, we restrict q ^ vector of zero- 
mean normally distributed random variables with the covariance matrix given 
by the solution of (38). With this restriction on the initial value xj,„j^g 

(solution of (21)) is a vector of Gaussian stochastic processes, non-stationary 
in general, and it is completely characterized by its mean, (30), and correlation 
function (given as the solution of (34) and (36) as a function of time). For cir- 
cuits with time-invariant large-signal waveforms, is a vector of stationary 
(in the strict sense) Gaussian processes, completely characterized by its covari- 
ance matrix (a constant function of time as given by the solution of (38)). 

5. Implementation in SPICE 

The noise simulation method, along with the noise models described, was 
implemented inside the circuit simulator SPICE3 [14]. Time-domain noise sim- 
ulation is performed along with the transient simulation in the time interval spec- 
ified by the user. The transient simulation in SPICE3 solves for x^, which is the 
solution of (7). The initial value vector a:(0) = xo in (7) is obtained by a DC 
analysis before the transient simulation is started. The numerical methods for 
solving (7) subdivide the time interval [0,r], in which the transient simulation 
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to — 0, t]^ — T, tf^\ — tf-\-hr^\ r — 0,1,. ..,i? (39) 

where hr+is are the time steps. At each time point the numerical methods 
compute an “approximation” x^[r] of the exact solution Xs{tr) [10]. 

The noise simulation (numerical solution of (34) and (36)) is performed con- 
currently with the transient simulation. (34) represents a system of m{m+l)f2 
linear differential equations. We currently use the Backward Euler scheme to 
discretize these equations in time. 

At each time point tr of the noise simulation, after the transient simula- 
tion routines have calculated the matrices A[r] = A{tr), C[r\ = C{tr) and 
B[r] = as defined by (13) and (15), are calculated using the values in jc^fr] . 
These matrices are stored in sparse matrix data structures. The routines for load- 
ing these matrices have been written for each device. The routines for loading 
B[r] contain the noise models for the devices. Then all the operations described 
in Section 4.2 are performed to calculate E[r] = E{tr) and F[r] = F{tr) from 
A[r], C[r] and B[r]. The numerical operations actually done somewhat differ 
from what has been described because of efficiency reasons. All of these op- 
erations are performed using sparse matrix data structures. Then, E[r] and F[r] 
are used to calculate [r] = (tr) in the discretized solution of (34) with the 
Backward Euler scheme. This last operation requires the solution of m{m+ 1) /2 
simultaneous linear equations, because Backward Euler is an implicit method 
[10]. Here, m is, roughly, the number of nodes to which a capacitor is con- 
nected. Simulations have shown that, for larger circuits, the CPU time spent for 
this last operation at a time point heavily dominates the CPU time required by 
the other operations. Most of the CPU time is used for solving systems of lin- 
ear equations. We currently use a general-purpose, direct method, sparse matrix 
solver to solve systems of linear equations. With this direct method linear solver, 
the computational cost of noise simulation is still high for large-scale circuits. 
Experiments with several circuits have shown that significant speedup can be 
obtained by using a parallel iterative linear solver (running on a CM-5) [17], es- 
pecially for larger circuits. CPU times obtained with this parallel iterative solver 
suggest that even using a sequential version of this iterative solver will reduce 
the computational cost of noise simulation considerably when compared with 
the CPU times obtained with the direct solver we currently use. 

The operations described in the above paragraph are performed at every time 
point. Upon completion, x^[r] =Xs(tr), r = 0,...,R contains the mean wave- 
forms for the circuit variables as a function of time, which is the usual SPICE 
transient simulation output. And [r] = /sT* (t,-) , r = 0, . . . , F contains the wave- 
forms for the covariance matrix of the noise contents in the circuit variables, as 
defined by (31), as a function of time, which is the noise simulation output. 
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6. Noise simulation examples 

In this section, we present two examples of noise simulation. In particular, 
noise simulations for a CMOS ring-oscillator circuit and a BIT active mixer cir- 
cuit will be presented. For both of these circuits, we have included only the shot 
and thermal noise sources in the simulation. One reason for this is that flicker 
noise has little effect on the noise performance of these circuits. Secondly, in- 
cluding the flicker noise sources increases the simulation time because of the 
extra nodes created for flicker noise source synthesis. 

6.1 CMOS ring-oscillator 

Three CMOS inverters loaded with 1 pF capacitors were connected in a ring- 
oscillator configuration and a noise simulation was done. In Figure 3, the mean 
and noise variance of one of the taps of this ring-oscillator can be seen. As seen 
in Figure 3, the noise at one of the taps of the ring-oscillator is non-stationary, 
that is, the noise variance is not a constant as a function of time. The noise vari- 
ance is highest during low-to- high and high-to-low transitions of the tap volt- 
age. Ring-oscillator based VCOs and delay-lines are used in many phase/ delay- 
locked systems such as clock generators and clock recovery circuits. Phase 
noise/jitter is a major concern in the design of such systems. Behavioral mod- 
els which capture noise effects, and behavioral simulation is used to predict the 
phase noise/jitter performance of these systems [15]. Our transistor-level noise 
simulator can be used to simulate ring-oscillator VCOs and delay-lines to obtain 
the timing jitter at the outputs of the delay cells (as well as the correlations be- 
tween the jitters.) This information is then used in behavioral simulation [15]. 



6.2 BJT Active Mixer 

This circuit was obtained from industry sources. It contains 14 BJTs, 21 re- 
sistors, 5 capacitors, and 18 parasitic capacitors connected between some of the 
nodes and ground. The LO (local oscillator) input is a sine- wave at 1.75 GHz 
with an amplitude of 178 mV. The RF input is a sine-wave at 2 GHz with an 
amplitude of 31.6 mV. Thus, the IF frequency is 250 MHz. 1// noise sources 
are not included in the simulation, because 1// noise is rarely a factor at RF 
and microwave frequencies [2]. This circuit was simulated to calculate the noise 
variance at the output as a function of time. ( Figure 4: This waveform is peri- 
odic with a period of 4 nsecs; IF frequency is 250 MHz.) The noise at the output 
of this circuit is not stationary, because the signals applied to the circuit are large 
enough to change the operating point. The noise analysis of this circuit by as- 
suming a small-signal equivalent circuit around a fixed operating point does not 
give correct results. Such an analysis would predict the noise at the output as 
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Figure 3. Noise Simulation for the CMOS Ring-oscillator. 



Stationary, i.e. a constant noise variance as a function of time. The noise per- 
formance of a mixer circuit is commonly characterized by its noise figure which 
can be defined by [1] 



total output noise power . 

NF = 7 : — ^ , : . (40) 

that part of output noise power due to the source resistance 

This definition is intended for circuits in small-signal operation. For such cir- 
cuits, noise figure is a scalar quantity. In our case, the noise at the output of the 
mixer circuit changes as a function of time over one period. We can generalize 
the noise figure definition such that noise figure is a quantity that is a function 
of time. For the mixer circuit we have simulated, the noise figure turns out to be 
a periodic function of time. To calculate the noise figure as defined, we simulate 
the mixer circuit again to calculate the noise variance at the output with all the 
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noise sources turned off except the noise source for the source resistance at the 
RF port. Then we can calculate the noise figure as below, and the result is shown 
in Figure 5. 



NF = 101og( 



Average of Total Noise Variance . 

Average of Noise Variance due to the source resistance only 

(41) 



As observed in Figure 5, the maximum and minimum value of the noise fig- 
ure over one period differs by over 4 dB. This BJT mixer circuit has 65 nodes 
(including the internal nodes for BJTs) which are connected to capacitors. The 
noise simulation requires the solution of 2145 (65 x 66/2) simultaneous linear 
equations at every time point, as it was explained in Section 5. The simulation 
(with 250 time points) took approximately 17 hours on a DECstation 5900/260 
with our current implementation (with the direct method linear solver). 




Time (nsec) 



Figure 4- BJT Active Mixer - Noise Variance at the Output. 



7. Future Work 

We plan to compare the results from this noise simulator with noise measure- 
ments on actual circuits. The numerical methods used in the noise simulator will 
be modified to make it more efficient (as explained in Section 5). We will be us- 
ing our transistor-level noise simulator in the top-down constraint-driven design 
of a clock generator circuit for a RAMDAC. The noise simulator will be used to 
extract noise parameters in the behavioral modeling of phase/delay-locked loops 
[15]. 
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Figure 5. BJT Active Mixer - Noise Figure. 
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Abstract 

This paper describes PRIMA, an algorithm for generating provably passive reduced order V-port 
models for RCL interconnect circuits. It is demonstrated that, in addition to requiring macro- 
model stability, macromodel passivity is needed to guarantee the overall circuit stability once the 
active and passive driver/load models are connected. PRIMA extends the block Arnold! tech- 
nique to include guaranteed passivity. Moreover, it is empirically observed that the accuracy is 
superior to existing block Arnold! methods. While the same passivity extension is not possible for 
MPVL, we observed comparable accuracy in the frequency domain for all examples considered. 
Additionally, a path tracing algorithm is used to calculate the reduced order macromodel with the 
utmost efficiency for generalized RLC interconnects. 



1. Introduction 

As integrated circuits and systems are designed with smaller feature sizes 
and for faster operation, RLC interconnect effects have a more dominant im- 
pact on signal propagation than ever before. In addition, parasitic coupling 
effects and reduced power supply voltage levels make interconnect modeling 
increasingly important. Since these interconnect models can contain thousands 
of tightly coupled R-L-C components, reduced order macromodels are impera- 
tive [1] [2] [3] [4]. Ideally, a simulator would isolate the large linear portions of 
the circuit from the nonlinear elements (e.g., transistor models) and preprocess 
them into reduced order multiport macromodels. 

It is well known that an A-port can be fully represented by its admittance pa- 
rameters in the Laplace domain, however, the objective is to apply model order 
reduction to produce low order rational approximations for each entry in Y{s) 
(see Fig. 1). To find Y(s), voltage sources are connected to the ports and the 
currents into the ports are measured. The voltage sources are the inputs to the 
system and the port currents are the outputs. A single-input single-output (SISO) 
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Figure 1. The multiport representation of a linear circuit. 



V-port model approach would perform model order reduction on each term Yij 
individually. Both Asymptotic Waveform Evaluation (AWE) [1] and Pad6 via 
Lanczos (PVL) [2], which are Pad6 approximations, can perform SISO reduc- 
tion by matching 2q moments for a qth order approximation of each Yij term. 
The Amoldi Algorithm [4] can also be used to obtain SISO approximations, 
however it matches only q moments for a (?th order approximation. MPVL (Ma- 
trix Pade via Lanczos) [5] and Block Amoldi [6] are multi-input multi-output 
(MIMO) versions of PVL and Amoldi respectively. In the block techniques, 
the system Modified Nodal Analysis (MNA) matrices are directly reduced by 
matrix transformations. 

Regardless of the reduction method used in all of the approaches cited above, 
the reduced order model of an RLC circuit can have unstable poles. It is al- 
ways possible to obtain an asymptotically stable model by simply discarding 
the unstable poles, however, passivity is not guaranteed. In addition, discarding 
unstable poles requires re-adjustment of the residues to improve the quality of 
the approximation. Passivity uncertainty is problematic since even the test for 
V-port passivity can be very costly for a large number of ports [7]. The coor- 
dinate transformed Amoldi Algorithm [8] was introduced as a remedy for the 
instability problem, but it does not guarantee passivity. The passivity extension 
of this stable Amoldi algorithm was recently developed in [9], however its appli- 
cability is limited to RC circuits only. The PACT algorithm [3] proposed a new 
direction for passive reduced-order model for RC circuits based on congraence 
transformations. The same authors proposed Split Congmence Transformations 
[10] for passive reductions of RLC circuits, producing equivalent circuit real- 
izations. In [10], however, the extra steps required to split the transformation 
matrix can result in a decrease in accuracy and efficiency. Moreover, the pas- 
sivity proof is somewhat controversial, and we will consider a more complete 
proof in this paper. 
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A passive system denotes a system that is incapable of generating energy, and 
hence one that can only absorb energy from the sources used to excite it [11]. 
As we will show in Section 2.2., passivity is an important property to satisfy be- 
cause stable, but not passive macromodels can produce unstable systems when 
connected to other stable, even passive, loads. A property in classical circuit 
theory states that: interconnections of stable systems may not necessarily be 
stable; but (strictly) passive circuits are (asymptotically) stable; and arbitrary in- 
terconnections of (strictly) passive circuits are (strictly) passive, and, therefore, 
(asymptotically) stable [12]. 

In this paper, we propose a Passive Reduced-order Interconnect Macromodel- 
ing Algorithm, PRIMA, based on the Block Arnold! Algorithm but with congru- 
ence transformations that produce provably passive reduced order macromodels 
for arbitrary RLC circuits. PRIMA has accuracy comparable to MPVL and su- 
perior to Block Amoldi. Furthermore, the block Amoldi vectors are generated 
with the utmost efficiency following the algorithms in RICE [13] that are used 
to calculate moments. This includes efficient handling of interconnect trees and 
meshes, as in RICE, but with renewed focus on efficient handling of large prob- 
lems with a huge number of mutual inductances. 

2. Background 

To obtain the admittance matrix of a multiport, voltage sources are connected 
to the ports. The multiport, along with these sources, constitutes the Modified 
Nodal Analysis (MNA) equations: 

Cx„ = -Gx„-|-Bup 

ip = L^x„ 

The ip and Up vectors denote the port currents and voltages respectively and 



G = 


N 

. -E'’ 


E ■ 
0 


c = 


Q 

0 


0 

H 


x„ = 


V 

i 



where v and i are the MNA variables corresponding to the node voltages and, 
inductor and voltage source currents, respectively. The nxn matrices G and C 
represent the conductance and susceptance matrices (except that the rows cor- 
responding to the current variables are negated as in [8]). N, Q and H are the 
matrices containing the stamps for resistors, capacitors and inductors respec- 
tively. E consists of ones, minus ones and zeros, which represent the current 
variables in KCL equations. Provided that the original 7V-port is composed of 
passive linear elements only, N, Q, and H are symmetric nonnegative definite 
matrices. This implies C is also symmetric and nonnegative definite. Since this 
is an yV-port formulation, whereby the only sources are the voltage sources at 



436 



THE BEST OF ICCAD 



the N port nodes, B = L. But we maintain the separate B and L notation for the 
generality of the equations. 

Returning to equation (1), following the notation in [2] we define 

A = -G-’CandR = G'*B. (3) 

With unit voltages at the ports, taking the Laplace transformation of (1) and 
solving for the port current variables, the y-parameter matrix is given as 

Y(5)=L''(I„-5A)-1R (4) 

where I„ is the n x n identity matrix. It is apparent from (4) that the eigenvalues 
of A represent the inverses of the poles of Y (s). 

Using any of the aforementioned model-order reduction techniques, we can 
find reduced order rational approximations to Yjk{s) terms, for all j,k < N. The 
reduced-order Y{s) can then be simulated along with other nonlinear and linear 
portions of the complete circuit using a simulator that employs either recursive 
convolution [14] or state-space realization [7], both of which have linear com- 
plexity. If the reduction is block, the reduced order multi-input multi-output 
circuit can also be realized using linear circuit elements. 

2.1 Block Arnoldi Algorithm 

The Block Arnoldi algorithm reduces the system matrix A in (3) to a small 
block upper Hessenberg matrix H^. To do so requires an orthonormal basis, X, 
for the corresponding Krylov space which satisfies the following:* 

colsp(X) = Kr(A,R,[^J) 

X^AX = (5) 

X^'X = 

where N is the number of ports and I, is a ^ x ^ identity matrix. The Krylov 
space is defined as 

Kr(A, R, k) = colsp[R, AR, A^R, . . . , A*R] . (6) 

Finding the reduced order admittance matrix can be explained by a change of 
variable, 

Xn = 0) 

where Zg is now the reduced order system variable, which reduces the number 

of unknowns in the system (q is generally much smaller than n). Substituting 
(7) into (1), then multiplying the first equation by X^G“* yields 

HgZq = Zq- X^RUpip = L^Xz^ (8) 



^The operator [j is the truncation to the nearest integer towards zero. 
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Therefore, in the Laplace domain, 



t(s)=L^X(Ig-sH5)-*X^R 



(9) 



where is the ^ x ^ identity matrix. 

The reduced order system equations and admittance matrix are given by (8) 
and (9) respectively. The poles of the reduced order system are the reciprocal 
eigenvalues of H^. A complete pole/residue decomposition can be obtained by 
eigendecomposing H^. Using the information in [6], it can be shown that the 
first moments of "^^( 5 ) in (9) match those of Y(s) in (4). 

2.2 Importance of Passivity 

It is always possible to come up with stable reduced order macromodels by 
utilizing a number of heuristics, however, none of these tricks can be used to ob- 
tain provably passive approximations. Moreover, in [7] it was shown that the test 
for passivity of an AWE-reduced A^-port macromodel is prohibitive in terms of 
CPU run time cost. Fig. 2 is a numerical example generated in [7] that demon- 
strates the passivity problem. Yi (s) in this figure represents a reduced order 
transfer function which has all poles and zeros in the left half plane. Y^ris) rep- 
resents a capacitor and resistor in parallel. If we drive this circuit with a current 
source, it will oscillate at 2.5/n Hz. To show that it is also a practical problem, 
we took a simple interconnect and connected the load and the nonlinear driver 
as shown in the examples section in Fig. 5. The interconnect is represented by 
a fifth order approximation obtained by PVL [2]. The figure clearly shows the 
growing oscillations at the output (instability) although all of the poles obtained 
from PVL were stable. A Thevenin equivalent linear driver with a resistance of 
2 ohms generates a similar instability for this 2-port example. 



YdrCs; Yi(s) 




Yj^(s) = 0.06 + 0.056J 




Poles at: 3.074 , ± 



Figure 2. A non-passive system example demonstrating potential instability. 
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O Connect voltage sources to the multiport and obtain the MNA matrices as in (2) 
O Set [bi|b 2 |...|bp] =Band [lilhl-M =L 
O Solve GR = B for R 
O (Xo, K) = qr (R) ; qr factorization of R 
O Set n = int {q/N) + 1 
O For A:= 1,2, ...,n 



Solve GXP = V for xf’ 
For;= 



TI-YT’ 






(Xjt, K) = qr \qr factorization of Xj[^ 

O Set X = [Xo|Xi |...|Xjt_i] and truncate X so that it has q columns only. 

O Compute C = X^CX,G = X^GX 

O Find eigendecomposition of G”^C : G“^C = SXS“^ [inversion of G can be avoided] 
A, = diag (A-1 , A»2, • • . , A,^) 

O To find poles and residues for fi j (5): 

Solve Gw = X^by for w 
Set/x=S^X^l/andv = S-^w 

^ 1,1 ••• ^\,p 



O Sett(s) = 



yp,i 



'p.p J 



Figure 3. The passive reduction algorithm. 



3. PRIMA: Passive Reduced-order Interconnect 
Macromodeling Algorithm 

The Block Amoldi Algorithm is employed in PRIMA to generate the or- 
thonormal basis for a congruence transformation matrix. After -f 1 (the 
extra step is not necessary when ^ is an integer) iterations of PRIMA, the n x ^ 
matrix X is found such that: 



colsp(X) = Xr(A,R, [^J) 
X^X = I, 



( 10 ) 
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In the classical Amoldi approach [4], the reduced order ¥( 5 ) is calculated using 
theqxq upper Hessenberg matrix in (5) as shown in (9): 

Y(s) = L^X(I^ - sH)~^X^R (11) 

In our variation, the conductance and susceptance matrices are directly reduced 
so that passivity is preserved during reduction. Applying the change of variable 
x„ = X{nxq}^q in (1)> nnd multiplying the first row by X^ from (10) yields 

(X^CX)iq = - (XTgX)x, + X^Bup 

ip = L^Xx, 

The reduced order MNA matrices, are, therefore, 

C = X^CX G = X^GX 
B = X^B L = Xa 

These types of transformations are known as congruence transformations. Con- 
gruence transformations were first introduced by [3] for order reduction of cir- 
cuits. From (12) and (13), the reduced Y(s), namely '^^( 5 ) , is now 

t(5)=L^(G-t-5C)-ifi (14) 

Since the sizes of G and C are typically very small, it is easy to find the poles and 
zeros of$'(s) by eigendecomposition. The complete algorithm is given in Fig. 3. 
It employs the Block Amoldi Algorithm using modified Gram-Schmidt orthog- 
onalization [6], which is mathematically equivalent to ordinary Gram-Schmidt 
process, but behaves better numerically [15]. In addition, it is possible to avoid 
the inversion of G to find the poles and residues by using a generalized eigen- 
decomposition. In this case, the computation of G“'B can be avoided by using 
(31) and replacing it by X^R. It is observed that this scheme is numerically 
much better. 

The complexity of the algorithm to produce q poles for an A-port is slightly 
less than AWE, PVL and MPVL. It requires 1 LU factorization (or path trac- 
ing equivalent as explained in Section 4.) of the G (MNA conductance) ma- 
trix, which dominates all the other computational costs and is conunon in all 
reduction techniques. However, to find q poles, only q backward-forward sub- 
stitutions are needed, whereas in MPVL, PVL and AWE, twice as many are 
required. As in MPVL, there will be only one eigendecomposition to find the 
poles and residues, whereas PVL requires eigendecompositions, since for 
each Yij{s), there will be a different Tq. AWE will solve the different Hankel 
matrices to get to the poles. 

3.1 Preservation of Passivity 

If the system described by (1) and (2) is reduced by the transformations in 
(13), it can be shown that the reduced system is always passive. In [16], neces- 
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sary and sufficient conditions for the system admittance matrix (eqn. (14)) 
to be passive are; 

1. "^^( 5 *) = ^*(s) for all complex s, where * is the complex conjugate opera- 
tor. 

2. "^^( 5 ) is a positive matrix, that is, z*^{^{s)+'^*^{s))z > 0 for all complex 
values of s satisfying 9l(s) > 0 and for any complex vector z. 

The second condition also implies the analyticity of ^^(s) for 9l(i) > 0, since 
is a rational function of s (details in [16]). Therefore, the test of analyticity 
is unnecessary. 

Due to the fact that the reduced matrices G, C, B, and L are all real since 
the transformation matrix, X, is real, condition 1 is automatically satisfied. To 
show that condition 2 is satisfied, we first set Yh{s) = Y(s) and use 

the property B = L (since B = L in our formulation when there are no sources 
inside the Y-port, X^B = X^L) and some algebra to obtain, 

z*^Yh{s)z = z*^(F(G-t-5C)-’B-KB^(G + 5*C)-^B)z (15) 

= z*^B^(G + sC)-\{G + sC) + (G + s*C)^) (G + s*C)-^Bz 



Setting w = (G -I- s*C) ^Bz and s = J(0 + a yields, 

z*^Y/,(s)z = w*^[(G-l-( 7 (o-l-a)C)-l-(G-l-(;(o-a)C)^]w 

= w*^[G-hG^-hCT(C-hC^)]w (16) 

= w*^X^[G-FG^ + a(C + C^)]Xw 



Similarly, let y = Xw to get 

z*^Yh(s)z = y*^[G + G^ + ct(C + C^)]y (17) 

Since C is symmetric, 4- C = 2C. C is known to be nonnegative definite 
(since we negate the rows corresponding to current variables as in (2)), so 

y*^a(C C^)y = 2oy^^Cy > 0 (18) 



for any complex vector y and a = 9t(s) > 0. N (the resistor stamps) is a sym- 
metric nonnegative definite matrix, therefore 



y*^(G-f-G^)y 



„»T 



N E 
-E^ 0 



+ 



N E 
-E^ 0 



y 



2N 
0 0 



(19) 



is also nonnegative definite for any complex vector y. From (17), (18), and (19), 
it follows that the second passivity condition is satisfied. 



Physical Simulation and Analysis 



441 



3.2 Preservation of Moments 

The transformation in (13) preserves moments of the original system, 
which is the same as the classical Block Amoldi reduction and half of that in 
MPVL. The proof is as follows. The exact (block) moments, M,-, of the circuit 
are given as: 

Mi = L^A'R (20) 

where A = -G“*C, R = G“*B, and G, C, B, L are the system matrices as 
defined in (1). 

Likewise, the moments of the PRIMA reduced order system are given by 

M/ = L^A'R (21) 



where A = G *C, R = G *B and G, C, B, L are as defined in (13). Substitution 
of(13) in (21) yields: 



Mi = L^X 



-(X^GX) '(X^CX)1‘(X^GX) ^X^B (22) 



It is shown in [6] that the Amoldi algorithm yields 

A‘R = XHjX^R, (23) 

Rearranging the terms and using the definitions from (13): 

AA'-’R = XH^X^R 

-G-'CA'-‘R = XHj,X^R 

(24) 

-CA'-iR = GXH^X^R 

-X^CA'-’R = X^GXH^X^R 

-X(X^GX)~’X^CA‘-*R = XH^X^R (25) 

Inserting (23) in (25) results in: 

KA'-iR = A'R, 0 < i < L^J (26) 

where 

K = -X(X^GX)“‘X^C. (27) 

From (26), it can be shown by recursion that 

K'R = A'R, 0<i< 



(28) 
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Therefore, using (27) it follows that 

X[-(X^GX)~‘ (X^CX)]' = K'X 



Replacing X - (X^ GX) * (X^CX)] ' in (22) with K'X yields 



(29) 



TfI/ = L^K‘X(X^GX) 'X^B (30) 

Evaluating (26) when i — 0 gives 

X(X^GX)'^X^B = R (31) 

Then from (30) and (31), 

= L^K'R, 0 < j < L^J (32) 

Finally, combining (32) and (28) with (20), it follows that 

= 0<i<[^J (33) 

Note that the number of poles in each entry of Ys) is q, and we have matched 
the first moments at all N ports, yielding a total of q moments. The number 
of moments matched in PRIMA is, therefore, the same as that for the Block 
Arnold! algorithm and half as many as matched by MPVL. 



4. Integration of PRIMA within RICE 

For all of the model order reduction schemes, the LU decomposition of the 
MNA conductance matrix (G in (2)) dominates the run time. In [13], RICE 
(Rapid Interconnect Circuit Evaluation) was described as a general path tracing 
algorithm to obtain moments with optimal efficiency for interconnect trees and 
mesh structures. Using RICE to calculate moments, the explicit construction 
and inversion of G is avoided, and the moments are more accurate than those 
obtained via matrix factorization. 

The moments of the circuit can be obtained recursively from: 



Mo = G'B 

Mt = G-^CM;t_i, k>0 



(34) 



where the matrices G, C, and B are as defined in (2). As shown in [1], this 
can be viewed as recursive dc circuit solutions, when capacitors and inductors 
are replaced by current and voltage sources respectively, with the values derived 
from the columns of CMjfc_i. The Krylov vectors, which can be viewed as well 
conditioned moments, can be obtained from a very similar recursive scheme: 
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1. Obtain zeroth moment and orthonormalize it: 

Solve Mo from GMq = B (35) 

Xo = or//t(Mo) (36) 

2. Recursively obtain higher order Krylov vectors: 

Solve M* from GM* = CX^-i (37) 

X,^ = M* - X*_, (X[_,M,) - ... - Xo (XjMi) (38) 

\k = orth{x^^ (39) 



The “orth” operator can be implemented as a simple Gram-Schmidt orthonor- 
malization procedure. The space spanned by the block Krylov terms 
(X*, Xk-i Xo) is called the Krylov space. Therefore, the Krylov vectors can 
be obtained via a path tracing procedure using RICE-like routines to solve for 
equations (35) and (37). 

The Krylov space constitutes the congruence transformation matrix, X in 
PRIMA. The reduced MNA matrices G and C are 

C = X^CX G = X^GX (40) 

Note, however, that the matrices CX and GX are obtain^ using RICE without 
explicitly constructing C and G. The columns of CX are the values of current 
and voltage sources that are used to replace capacitors and inductors at each 
moment computation stage. This information is easily obtained during a path 
trace [13]. The Id^ block of GX (i.e. GX;t) is a function of previous blocks of 



GX and CX^-i since from (38), 

GX^ = GMk - GXk-i (X[_iM*) GXoiXlMk) (41) 

and using GM* = CX^-i , 

GXl = CXk-i - GXk-i (X[_iMt) GXoiXlMk) (42) 



5. Time-domain Simulation of Macromodels 

For a complete circuit simulation, the nonlinear elements should be simulated 
along with the reduced order macromodels. There are two ways to include the 
PRIMA macromodels into circuit simulators such as SPICE [17]. One approach 
is in terms of the frequency domain y-parameters. Combining the nonlinear time 
domain analysis in SPICE with the frequency dependent y-parameters requires 
convolution of 0(T2) complexity, where T is the number of simulation time- 
points. For this reason, recursive convolution [14] and time-domain y-parameter 
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macromodels [7] were developed, where the complexity is linear with the num- 
ber of time-points. The second method is the direct stamping (i.e. circuit model 
realization). Since the reduction method we use is block, the reduced matri- 
ces can be directly stamped into the SPICE MNA matrices. Noticing that the 
reduced order q- variable system has the equation shown in (12) and (13), and 
recognizing that it is possible to introduce as a circuit variable into the MNA 
matrix, the direct stamps for the macromodel can be generated as below: 



Stamps for 0 0 




tiNL 




’ ^NL ' 


f{xNL,Up) In 0 




Up 






0 0 -L^ 








0 


0 -BO (G-fC|) _ 




. *9 . 




0 



In (43), xsL denotes the other variables of the circuit (other node voltages and 
currents) Up and ip are port voltages and currents respectively, and Xg denotes 
the extra variables that are introduced from the inclusion of realized macromodel 
into the circuit. Since C is a symmetric and real matrix, it can be diagonalized 
using singular value decomposition. In this case, all the capacitance values will 
be real and positive, since they will be the singular values of C. Note that it is 
also possible to come up with a realization scheme similar to (43) starting from 

t(^). 



6. Results 

In this section, PRIMA is demonstrated and compared with other approaches. 
All reductions are done using RICE v5.0, a program which integrates the PRIMA 
algorithm with the RICE moment calculation routines. For the frequency do- 
main examples, the y-parameters are compared with the reduced order models 
from different reduction methods. Time domain results via recursive convolu- 
tion are obtained using a modified version of SPICE3f4 [18]. For all the exam- 
ples, the poles obtained via PRIMA were observed to be stable. 

6.1 Mesh Ground Plane 

With the ability to calculate a large number of poles accurately, PRIMA can 
be applied to analysis problems which include complex, high frequency re- 
sponses. One such application is the R-L mesh plane encountered in MCM and 
packaging problems. Since such a problem is strongly coupled, the L-matrix is 
dense and thereby destroys the matrix sparsity in a classical SPICE simulation. 
In this example, the ground plane is modeled by a 20x20 mesh, and each square 
is modeled as a resistor and an inductor. The coupling can be adjusted to make 
the inductance matrix sparse as described in [19]. 

There is an RL line over the ground plane that is terminated with a capaci- 
tor load, as shown in Fig. 4. Also shown in the figure are time domain results 



Physical Simulation and Analysis 



445 




plane.0, 2poles 




4 









^ PRIMA 



qU 2 jj ^ 

time (ns) 



plane_4, Spoles 




Circuit complexity 



Circuit name 


#ofR 


#ofL 


#ofK 


plane_0 


859 


858 


0 


plane_4 


859 


858 


17,892 


plane_full 


859 


858 


183,693 



Run time comparisons 



Circuit name 


HSPICE 


PRIMA 


RecConv 


plane_0 


39.97 secs 


0.25 secs 


0.03 secs 


plane_4 


17,343 secs 


1.23 secs 


0.05 secs 


plane.full 


can not run 


11.73 secs 


0.05 secs 



Figure 4- Mesh ground plane example. 



from PRIMA and HSPICE for various levels of L-matrix sparsity. The full ma- 
trix response is also shown for PRIMA, but the HSPICE simulation would not 
complete its run due to memory and run-time limits. 

This circuit is a worst-case interconnect topology for a path tracing algorithm 
[13] (all loops), however, RICE v5, our path tracing implementation of PRIMA, 
showed excellent speed-up over HSPICE, a conunercial circuit simulation tool. 
The table in this figure also includes the time required for recursive convolution 
of the reduced-order model in SPICE3f, denoted by RecConv. 

6.2 Nonlinear Driver Driving a Transmission Line 

Fig. 5 shows a lossy transmission line represented by 40 lumped RLC sections 
and reduced to five poles using both PVL and PRIMA. Although all of the poles 
from PVL were stable (i.e. negative real parts), the overall PVL response was 
clearly unstable as shown in Fig. 5. The fifth order approximation from PRIMA 
is indistinguishable from the exact response, which was obtained by an HSPICE 
simulation for this example. 

6.3 Coupled Noise for a Two-bit Bus 

Next consider the two-bit bus driven by CMOS inverters in Fig. 6. One of the 
drivers is switching while the other is quiet. The interconnect, consisting of 40 
coupled RLC sections, is modeled as a 4-port and reduced by PRIMA. Transient 
analysis is done using recursive convolution. The time domain waveforms at 
the load end are compared for various order of approximations. Since this is 
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Figure 6. Waveform comparisons for a four port. 



a 4-port, an 8 pole approximation corresponds to matching only mO and ml 
generated by four different sources. The plot shows that in the time domain, even 
the coupled noise can be accurately simulated using the 8 poles from PRIMA. 
Although the interconnect inductance was exaggerated in this example to make 
the approximation more difficult, it is observed that an 8th order approximation 
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Exact 


Reduced 


Full Simulation 


Simulated After Direct Realization 


Simulated by Y-parameter based 


17.78 sec 


0.6 s. with 8 poles 


0.18 s. with 8 poles 




3.98 s. with 16 poles 


0.28 s. with 16 poles 




10.29 s. with 24 poles 


0.32 s. with 24 poles 



Table 1. Run time comparisons. 



is sufficient to capture the coupled noise from the active driver to the quiet load 
end. 

To compare the difference between direct realization and y-parameter based 
simulation (i.e. recursive convolution here), the reduced order circuit (via 
PRIMA) is simulated using both techniques. In the direct realization, is diago- 
nalized to increase the speed. The run times are given in Table 1. Although the 
circuit is relatively small (i.e. G is only 300x300), the gain in using a PRIMA 
reduced macromodel and y-parameter based simulation is about 50x over di- 
rect realization. For larger circuits such as the mesh plane example, this gain 
is expected to be much larger. Direct realization is inferior when the order of 
approximation gets bigger, mainly because the dense matrix gets larger. 

6.4 Six Coupled Transmission Lines 

The second example is a 12-port containing six coupled transmission lines 
modeled by 40 coupled RLC sections. The input admittance (Tn (5)), reduced 
by Block Arnold!, MPVL and PRIMA are compared with the exact input ad- 
mittance in Fig. 7 using 48th order approximations in all cases. Block Arnold! 
captures the exact response up to 16 GHz, while MPVL and PRIMA match up 
to 28 GHz. When the order of approximation is increased to 72 poles, it is 
observed that the frequency spectrum is captured up to 60 GHz by MPVL and 
PRIMA. 



6.5 Large Coupled RLC Circuit 

The third example in Fig. 8 displays the responses for a 3-port composed of 
densely coupled RLC circuits. Approximations are done using 25 poles for the 
three methods. As can be observed from the figure, both PRIMA and MPVL 
capture the entire frequency spectrum. 

7. Conclusions 

This paper presented a novel algorithm for producing provably passive macro- 
models for arbitrary RLC circuits. The method uses a Block Arnold! algorithm 
to generate the vectors needed for applying a transformation to the macromodel 
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Figure 7. Y1 l(s) in the frequency domain for six coupled transmission lines. 




Figure 8. 3-Port consisting of a large lumped RLC circuit. 



MNA matrices. Empirical results show that PRIMA produces comparable or 
superior results in terms of accuracy with respect to all other known reduction 
techniques, but superior in that it guarantees the passivity that is critical for time 
domain analyses. The implementation of PRIMA with path tracing algorithms 
from RICE enables extremely accurate high frequency response approximations 
of enormous, complex, RLC circuits with excellent efficiency. 
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The PRIMA algorithm presented in this paper can be easily extended to im- 
plement a number of heuristics such as moment shifting [13] and frequency 
shifting [20]. However, these heuristics are unnecessary and merely increase 
the complexity. 
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Abstract 

This paper introduces a new circuit noise analysis and modeling method. The noise analysis 
method computes an analytic expression of frequency, in rational form, which represents the 
Pad^ approximation of the noise power spectral density. The approximation can be carried out 
efficiently, to the required accuracy, using a variant of the PVL [1] or MPVL [2] algorithms. 
The new method is significantly more efficient than traditional methods for noise computation 
at numerous frequency points. In addition, it allows for a compact and cascadable modeling of 
noise that can be used in system-level simulations. 



1. Introduction 

Noise is a fundamental phenomenon in electronic circuits caused by the small 
fluctuations in currents and voltages that occur within the devices in the circuit. 
The fluctuations are due mainly to the discontinuous nature of electric charge. 
Determining the effects of noise is very important, as noise often represents the 
fundamental limit of circuit or system performance. 

Noise analysis algorithms for circuits in DC steady-state have been available 
for a long time in programs such as SPICE [6]. The results of such programs is 
the noise power over a range of frequencies in tabulated form. Circuit or sys- 
tem designers typically reduce the information contained in the noise spectrum 
to one single number such as the noise figure [3]. While such compact repre- 
sentations offer good insight, and are very convenient for back-of-the-envelope 
calculations, CAD tools at both the circuit and the system level can take ad- 
vantage of the more accurate and complete information available in the noise 
spectrum, and, in return, offer more accurate analysis. 

In this paper we introduce an algorithm that computes the noise power spec- 
tral density as a closed-form rational expression. More specifically, the algo- 
rithm computes the Padd approximation of the noise power spectral density 
using the numerically robust and efficient Lanczos [4] method. The spectrum 
is computed to the required accuracy over the frequency range of interest. This 
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method is significantly more efficient than repeatedly evaluating the noise power 
over a fine frequency grid. However, the real advantage of the new approach 
consists in the compact noise models that get produced. These compact noise 
models with the spectral density specified as rational expressions can be ac- 
cepted by the algorithm as input noise sources. In other words, the algorithm can 
consume its own output, thus lending itself to a hierarchical analysis methodol- 
ogy- 

For example, the circuit designer designs an amplifier, analyzes it, and pro- 
duces a high-level model, say, of its transfer function and of its output noise 
spectrum. The system designer uses models for all the components to perform, 
using the same noise analysis algorithm, a system-level simulation. No informa- 
tion is lost due to a too narrow interface between circuit-level and system-level 
models. 

The paper is organized as follows. In the next section we review the circuit 
noise analysis problem. In Section 3 we reformulate the noise spectral density 
expression in a form compatible with Padd via Lanczos (PVL) model reduction. 
Section 4 discusses the application of the PVL algorithm to this particular prob- 
lem. Finally, in Section 5, we illustrate the noise analysis problem with a few 
circuit examples and then present concluding remarks in Section 6. 

2. Review of circuit noise analysis 

The principal mechanisms of noise in integrated circuits are [3]: thermal, 
shot, and flicker noise. Mathematically, device noise is modeled by stochastic 
processes [5]. Stochastic processes represent ensembles of functions of time, 
n(t), and are characterized in terms of statistical averages, such as the mean and 
autocorrelation, in the time domain, or the noise power spectral density in the 
frequency domain. 

Thermal noise is caused by the thermal agitation of the electrons and occurs 
in almost all devices. Thermal noise is modeled by a parallel current source, the 
value of which is a zero mean stochastic process with a frequency-independent 
(white-noise) spectral density equal to 

S^l^((0)=4kTG. ( 1 ) 

Here k is Boltzman’s constant, T the absolute temperature, and G the conduc- 
tance. 

Shot noise is due to the fact that the current through a PN junction consists 
of discrete charge carriers randomly crossing a potential barrier. Shot noise in a 
junction is also modeled by a parallel, white-noise current source. The spectral 
density of shot noise is 

•^sh(®) = 

Here q is the electron charge and Id the average junction current. 



( 2 ) 
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Flicker (or noise occurs in all active devices and even in some resistors 
due to mechanisms that are not well understood. Flicker noise is modeled by a 
stochastic process with a frequency-dependent spectral density 

Here I is the average direct current, K\ is a technology-dependent constant char- 
acterizing a particular device and process, a is a constant in the range 0.5 - 2.0, 
and is a constant of about one (hence the name j noise). 

In this paper we only consider circuits with a constant excitation in DC 
steady-state. The methods presented below can also be applied to circuits with 
time- varying bias conditions [7], but such an extension is beyond the scope of 
the present paper. Moreover, noise is assumed to represent a “small” perturba- 
tion to the circuit. 

The circuit equations under these assumptions are 

jx{t),bQ,n{t)) = 0. (4) 

Here x{t) is the vector of circuit variables, typically currents and voltages, is 
the constant DC excitation, and n{t) is a vector of “small” perturbations caused 
by the noise sources. 

Moreover, we assume that the circuit is stable, and therefore the solution, xq, 
of the noiseless circuit is constant in time and satisfies 



/(xo,0,feo,0) =0, 



( 5 ) 



since, obviously, = 0. 

The response of the circuit in the presence of the perturbation n{t) is the 
perturbation, z(t), of the DC solution. The perturbation z(t) satisfies 

f(xo-hz(t),^z(t),bo,n(t)) =0. (6) 

Assuming that the noise perturbation is “small”, the first-order Taylor expansion 
of the circuit equations (6) around the DC solution is sufficiently accurate. Thus 

f{xoAbo,0) + Gzit)+Cj^z{t)-Bn{t) = 0 (7) 



where 



r= ^ 

dx 



II 




and B = -^ 


^0,0, ^0,0 


? 

j:o,0,^o,0 


on 



xo,0,l>o,0 
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Subtracting (5) from (7), we are left with just the system 

j 

Gz{t) + C—z{t) = Bn{t) (8) 

at 

of linear differential equations for the perturbation signals. 

The vector stochastic process n(t) is specified in terms of its frequency- 
domain cross-spectral density matrix Sxx{(o). The diagonal elements in Sxc(w) 
represent the power spectral density of each noise source, and the off-diagonal 
elements describe statistical coupling of noise signals. In practical cases, Sxx 
will almost always be a diagonal matrix. 

The noise analysis problem reduces to that of the propagation of a stochastic 
process through a linear system. The general expression of the noise power 
spectral density, Syy, at the output of the linear system is given by the well- 
known formula [5] 

SyyUd)) = //(;co)5^(;0))//«(;co). (9) 

When only one output is analyzed, Syy{j(i)) is just a scalar function of frequency. 
For more than one output, Syy{j03) is a full matrix, the dimension of which is the 
number of outputs. The diagonal elements of Syy{j(o) represent the power spec- 
tral density of the noise at each output and the off-diagonal elements represent 
the cross-spectral density. 

The many-to-one vector transfer function of the linear system from the noise 
sources to an output port of interest is 

//(;•©) = /^(G + ;coC)-‘fi, (10) 

where I denotes the incidence vector that corresponds to the output port of in- 
terest. More generally, when more than one output is considered, we have a 
many-to-many matrix-transfer-function from noise sources to the outputs, 

H{j(x)) = L^{G + j(0C)-^B, ( 11 ) 

where L is the incidence matrix of the output ports. 

From formula (9), using (10), we obtain the following expression for the noise 
power spectral density at the output of the system: 

Syy{j(o) = 1^{G+ j(oC)-^BS^{j(a)B'^{G+ ;o)C)-«/. (12) 

The noise analysis method implemented in SPICE [6] evaluates this expression 
efficiently, for a given (O, using the solution 

Xa{j()i) = {G + j(OC)~^l 

of the adjoint system. In terms of Xa{j(£>), (12) reduces to 
Syy{j(ii) — ^{j(ti)BSxx{j(ti)B Xa{j(0). 



( 13 ) 
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The SPICE noise-computation algorithm computes the noise power only at a 
given frequency. When we are interested in the spectrum of the noise, we must 
repeat the procedure for a large number of frequencies. 

We now introduce a novel noise analysis method that computes a closed-form 
rational expression for the noise power spectral density over a wide frequency 
range. This method is more efficient than the classical method, when noise 
needs to be computed over a frequency grid. Moreover, the closed-form ex- 
pression represents a compact model of the noise spectrum, and can be used 
hierarchically in system-level simulations. 



3. Reformulated noise spectral density 

For the following, it will be convenient to introduce the new variable s := j(H. 
Note that, in order to be “physically” meaningful, the variable s has to be purely 
imaginary. For now, we thus assume that s is purely imaginary. Later on, we 
will drop this constraint and treat 5 as a general complex variable. 

The new noise analysis and modeling algorithm relies on the computation of 
a Padd approximation of the noise power spectral density expression (12). The 
Pad6 approximation of a general transfer function expression of the form 

F(s)=F(G + sC)-^f, (14) 

where f and I are vectors of length N, and G and C are N x N matrices, can 
be computed efficiently with the PVL (Pad6 via Lanczos) algorithm [1]. At 
first glance, it appears that noise-type transfer functions (12) are very different 
from (14). However, we will show that there are vectors r, I and matrices G, C 
so that the functions (12) and (14) agree for all purely imaginary s, i.e., for all 
physically meaningful values of s. 

First, consider the case when the noise sources are all white. Then Sxx is not 
a function of the frequency, and thus the function (12) reduces to 

F{s) = f{G + sC)-^BSxxB^{G + sC)-^l. (15) 



Here we have used the new variable s = y(0. We rewrite (15) by introducing two 
new vectors, u and v, as follows: 



F{s) = /^M, 

V =(G-hsC)-»/, 

u ={G + sC)-^BS^B'^{G + sC)-^l 
= {G + sC)-^BSxxB'^v. 



(16) 



The vectors u and v 
tions: 



represent, therefore, the solution of a system of linear equa- 



0 (G + iC)”' 


U 






G -l- sC —BSxxB'^ 


V 




0 



( 17 ) 
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From (16) and (17) we obtain the expression 



F(s)=[i’' 0] 



‘0 G^ 




0 -C'T' 


V 




G -BSxxB'^ 


+ s 


C 0 


) 


0 



(18) 



Note that (18) is exactly of the form (14), and thus amenable to PVL reduction. 

Unfortunately, as shown in the previous section, not all noise sources are 
white. In order to be able to treat more general noise sources, we actually con- 
sider a more general class of noise-type transfer functions. More precisely, we 
study functions of the form 



F{s) = l'^{G + sC)-'^BP-\s)B'^{G+sC)-^l. (19) 



Here, / is a real vector of length N, G and C are real NxN matrices, fi is a real 
NxM matrix, and P{s) is a matrix polynomial, 

p{s) = Pq + Pis + P 2 S^ + --- + Pl^, (20) 

whose coefficients Pi, i = 0, 1, . . . ,L, xM matrices. We assume that Pi 
is not a zero matrix, so that L is the degree of the matrix polynomial P{s). The 
form (19) can express practically all interesting noise power spectral densities. 
The degree L itself can be arbitrary; however, the cases of low degree such as 
L = 0, 1,2 are the most important ones. For example, for L = 0 and Pq = 
the function (19) reduces to the case (15) of white noise. The flicker noise 
frequency-dependent power spectral density (3) can also be well approximated 
by an expression of form (19) by expanding the denominator into a power series 
as follows: 

5fl(to) =^ri/“ (co-f-cis-|-C2S^ + ...) *. (21) 

Rewriting the noise-type transfer function F{s) given by (19) in the form (14) 
then allows us to compute Pad6-based reduced-order models for F{s) by simply 
applying the PVL algorithm to the representation (14) of F{s). 

Next, we show how to transform (19) to form (14). Consider the linear system 



0 {G + sCf 0 0 ••• o' 




X 




T 


G+sC 0 B 0 • • • 0 




y 




0 


0 B^ Po + sP\ sP 2 ••• sPi 




Z\ 




0 


0 0 si -/••• 0 




Z2 




0 


0 Os/-/ 




zl_ 




0 



From the last L - 1 blocks of equations in (22), it follows that 



Zi = szi-\ for all i = 2, 3, . . . , L, 



(23) 
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and thus 

Z/ = 5‘"'zi for all i = 2,3,...,L. (24) 

Using the third block of equations in (22), together with (24) and (20), we get 

B^y =-{Pq + sPi )zi + SP 2 Z 2 + • • • + sPiZL 

= -{Po + sPi+s^P2 + --- + s^Pl)zi (25) 

= -P{s)z\. 



By the first two blocks of equations in (22), we have 

y =(G + sC)-H/, 

X =-{G+sC)-^Bzi. 



(26) 



Combining (25) and (26), we get 

x={G + sC)-'^B{P{s))-'^B'^{G + sC)-^1. 



(27) 



Next, we observe that, for purely imaginary s, the linear system (22) can be 
rewritten in the form 



(G + sCj X = 1. 

Here x and I are vectors of length N defined by 





X 




T 




y 




0 


x = 


Zl 


and 1 — 


0 




_zl_ 




0 



and G and CaieNxN matrices given by 

0 G'r 0 0 

G 0 B 0 

0 B’’ Po 0 

0 0 0 -/ 



G = 



0 



0 



0 0 -/ 



(28) 



(29) 



(30) 
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and 


’0 


-C^ 


0 


0 ••• 


O' 






C 


0 


0 


0 


0 




C = 


0 


0 


Pi 


P2 


Pl 


(31) 




0 


0 


I 


0 


0 






1 




• . 




0 






0 


... 




0 I 


0. 




Using (19) and (28)-(31), it follows that 









F{s) = l'^{G+sC)-^B{P{s))-'^B'^{G + sC)-^l 
= fx = Fx = F{G + sC)~^ I 

This shows that, for purely imaginary s, the noise-type transfer function (19) is 
indeed of the form (14) with I = f,G, and C defined in (31). 

Of particular interest are several special cases: 

O The case L = 1. 

Here I, G, and C reduce to 









'0 




o' 




O 

1 

o 


/ = 


0 


, G = 


G 


0 


B 


, c= 


coo 




0 




0 




Po 




0 0 Pl 



O The case L = 0. 



This is the case (15) that all noise sources are white. It is covered by (33) 
with Po = 5^' and Pi = 0. However, in this case, we eliminate the third 
block rows and columns in (33) and obtain 



1 = 



I 

0 ’ 



’0 


gt ■ 




0 -C'f' 


G 




? ^ ~ 


C 0 



(34) 



This is exactly the form arrived at in (18). 



4. Application of PVL 

Now that we have shown how to reformulate noise-type transfer functions 
F{s) given by (19) in the “PVL” form (14), it is straightforward to employ PVL 
to generate reduced-order models. Recall that, in our case, / = f in (14). 

First, we choose a real expansion point sq compute L and 11 via the 
factorization 



G + sqC = L-11. 



(35) 
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F(^Sq-\-G) — "I" cC) i 

= {l+oL-^CU-^)~^ 

Next, we apply the Lanczos process to the matrix A = L~^CU~^, using b = 
L~^l and c = ‘U~^l as the right, respectively left, starting vector. After running 
the Lanczos process for n iterations, we obtain an n x n tridiagonal matrix, T„, 
such that the function 

Fn{so + a) = ( J*) ■ej{l + aT„)-^ d , (37) 

where e\ represents the first unit vector of length n, is just an n-th Padd approx- 
imant to F{sq + o). More precisely, F„{so + c) is a rational function of a with 
numerator polynomial of degree at most n - 1 and denominator polynomial of 
degree at most n such that 

Fniso + a) = F{so + g) + 0(n‘?W), (38) 

where q{n) is maximal. In the generic case, q{n) = 2n. Note that (38) just states 
that the Taylor expansions of F„ and F about the expansion point sq agree in as 
many leading Taylor coefficients as possible. We note that all quantities involved 
in the Lanczos process are real, as long as the coefficient matrices Pq,Pi,...,Pl 
of (20) are real, which is usually the case. 

We observe that the reduced-order model for the noise spectral density of a 
circuit module will always have the form in (37), which results from the PVL 
algorithm. If the reduced-order models of circuit modules are used in higher- 
level simulations, expressions of the form in (37) appear in the 5xc(s) noise 
source spectral density matrix of the system simulation. The resulting output 
noise spectral density of the system will have the form 

F{s)=l'^{G + sC)-^B{Po + sPi)-^B^{G + sC)-^l. (39) 

This form is compatible with PVL, as shown for the special case L = 1 in (33). 

Finally, we make some comments regarding properties of the PVL algorithm 
specific to its application to “noise”-type problems. So far, we have made no 
assumptions on the matrix polynomial 

P(^s) = Pq + Pis + P2S^ + --- + Pls^. (40) 

If the function F{s) describes the noise power spectral density of a circuit, then 
P{s) needs to be such that 



F(yto) >0 for all © > 0. 



(41) 
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Ideally, the PVL algorithm is stopped as soon as the Pad6 approximant F„ has 
converged to F in the frequency range of interest, i.e., if 

|F(_/0)) — f^(j’(0)| < tol for all w € [WmimtOmax]- (42) 

Together with (41), this ensures that the Pad6-based reduced-order model 

Fn{sQ + a) = {Jb) • ej (/+ ar„)-' ei (43) 

satisfies 

Fn(J(O)>0 for all O) € [C0mi„,(0max]. (44) 

This observation is important if we want to use the reduced-order model as noise 
sources in a high-level simulation. 

As a final note, we remark that the discussion in this and previous sections 
can be generalized for the computation of the cross-spectral density matrix of 
multiple outputs. In the multiple output case the I and I vectors become matrices 
and MPVL [2] is used instead of PVL. 

5. Examples 

We applied the noise computation algorithm to a number of circuits. The first 
example is the 741 operational amplifier. The size of the problem is 55 variables. 
Figure 1 shows the exact transfer function of the amplifier compared to the PVL 
reduced-order models of orders 16 and 20. The order 20 approximation cap- 
tures the behavior of the amplifier almost exactly. Figure 2 shows the amplifier 
output noise power spectral density over the same frequency range. Here a Pad6 
approximation of order 5 is already sufficient to capture the noise spectrum. 

The next example is a 5-th order Cauer filter that uses ten 741 opamps as 
building blocks. The total size of the problem is 463 variables. Figures 3 and 4 
show the transfer function and the output noise spectrum computed exactly and 
with PVL. We observe that we need roughly the same number of iterations to 
obtain an almost perfect match of both the transfer function and the noise spec- 
trum. 

The final example is a bandpass filter derived from a 3-rd order Chebyshev 
low-pass prototype, and implemented with single amplifier biquads. It also uses 
the 741 opamp as a building block. The problem size is 147. Figures 5 and 6 
show the transfer function and the output noise spectrum computed exactly and 
with PVL. We observe that we need 18 iterations to match the transfer function 
and only 14 to match the noise spectrum. 

6. Conclusions 

In this paper we have introduced a new noise analysis method that computes 
the noise power spectral density of a circuit node or the cross-spectral density of 
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741 transfer function 




Figure 1. 741 gain. 



741 output noise 




frequency (Hz) 



Figure 2. 741 noise. 
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Cauer filter transfer function 




frequency (Hz) 



Figure 8. Cauer filter transfer characteristic. 




frequency (Hz) 



Figure 4- 



Cauer filter noise. 
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Bandpass filter transfer function 




Figure 5. Bandpass filter transfer characteristic. 




Figure 6. Bandpass filter noise. 
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a number of nodes. The results are presented in the form of a closed-form poly- 
nomial rational function of frequency, which represents a Fade approximation 
of the true noise spectral density. This method is significantly more efficient 
than the classical noise analysis method for predicting noise over a range of 
frequencies. The main advantage of the method, however, is the fact that it pro- 
duces a reduced-order model of the noise generated by the circuit under analysis. 
This model can then be employed, using the same algorithm in a system-level 
analysis. The noise analysis algorithm accepts the noise source power spectral 
density in a rational polynomial form. This form covers practically all possible 
noise sources of interest. 
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1. Introduction 

Since the early 70’s, physical design has been a crucial part of chip design. Its 
origin came from the need for obtaining optimum electronic packaging, which 
involves the placement of components and modules on boards and interconnect- 
ing the pins on modules. Most early approaches were by and large empirical 
and ad. hoc. One major exception is perhaps Lee’s maze router [8], which has 
proven to be a powerful computational tool for routing. In the late 70’s more an- 
alytical tools for layout began to evolve both from industry and the universities. 
The field of physical design gradually attracted sizeable research interests. 

The adoption of automated physical design tools in the semiconductor indus- 
try has accelerated since mid 1980’s. This is mainly driven by the number of 
transistors available on a single chip. As we write this paper today in the year 
2002, process technology is at 90 nanometers in critical process dimension. This 
is 2500 times improvement over 1985. In terms of layout objects the increase 
is fi'om the low thousands to high millions for a real-world chip. As a result, 
physical design today is totally dependent on automated tools. The quality of 
physical design tools determines the competitiveness and cost of the final prod- 
uct. The problem sizes are so large that no one can tell whether a generated 
solution by the software is good or bad until a better solution is presented. The 
fact that the physical design problem is NP-Hard makes the research of physi- 
cal design algorithms extremely interesting. This can be seen by the investment 
made by the electronics industry. During this time period, the electronics design 
automation industry (EDA) has grown from $50M to over $4B in size. 

The ICCAD Conferences started in 1983, have published many outstanding 
papers in the field. The present chapter includes six such papers, which span 
some important sub-areas: partitioning, placement, floorplanning, and clock 
skew. It should be pointed out that they by no means cover all areas of interest 
in physical design. As an example, neither global routing nor detailed routing 
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is included. Also, some cracial issues of current interests in deep sub-micron 
design, i.e., timing-driven layout, the important consideration of interconnect, 
and the integration between physical design and logic synthesis are completely 
left out. 

There exist previous review papers and edited books, for example, those con- 
tributed by Breuer [9] in 1979, Soukup [10] in 1981, Hu and Kuh [11] in 1985, 
Ohtsuki [12] in 1986, Kuh and Ohtsuki [13] in 1990, Camposano and Pedram [7] 
in 2000, and more recently excellent textbooks by Lengauer [14], Sherwani [15], 
and Sarrafzadeh and Wong [16], etc. 

Our paper will first make brief comments on each of the six papers included in 
the Chapter. Furthermore, it will mention some of the related later work because 
much work has appeared since 1995 (the latest among the selected papers), espe- 
cially with respect to the interest in deep sub-micron technology. Some practical 
aspects of various problems will also be mentioned. Finally, we will treat briefly 
an important subject on the integration of physical design and logic synthesis. 

2. Topics on selected papers 

2.1 Partitioning (with comments on [1]) 

Partitioning is an important part of chip and circuit design. It is crucial in 
the key problems in physical design, for example, in placement and floorplan- 
ning to be discussed in Sec. 2.2 and Sec. 2.3. The input of the problem is 
usually defined as a graph with vertex representing a device, a small cell, or a 
larger module in a circuit or chip. The edge of the graph gives the interconnect 
information, i.e., the number of connecting lines between two cells. The prob- 
lem is to partition the graph into two or more parts so that the total edge cut 
is minimum . The partitioned parts are called partitioning subgraphs. When the 
partition is two way, we call it bipartition. Often it is required that the partitioned 
two subgraphs are balanced, i.e. each has about the same size. In [17], a ratio 
cut has been proposed to assign the partitioned sizes as variables and balance the 
partition with a rational function. Partition can also be discussed with respect to 
hypergraphs, i.e., vertex are connected by multi-terminal nets. 

Because partitioning has broad applications in many fields, there exist well- 
known techniques and algorithms, for example, that of Kemighan and Lin [18] 
together with its efficient implementation by Fiduccia and Mattheyes [19], the 
spectral method [20], and the simulated annealing method. Paper [1] presents a 
method using network flow. The essence of the proposed method is as follows: 
From the given graph, a much larger associated graph called the flow network is 
introduced. In the flow network there is a source s and a sink t. We consider the 
flow from s to t. The well-known max-flow min-cut algorithm used in opera- 
tions research is applied to the partitioning problem, which leads to an optimum 
partitioning with respect to s and t. This in turn gives a solution to the origi- 
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nal bipartitioning problem with minimum cut; however, the result is usually not 
balanced, i.e., the bipartition leads to two subgraphs of vastly different size. 

The next step of the method is to introduce a balanced-bipartition in the flow 
network. This calls for collapsing one of the subgraphs to either s or t and 
repeating the min-net-cut. The process continues until the recombined parts 
reach a stage so that two subgraphs of the flow network are balanced. An ef- 
ficient implementation of the iterative process is proposed in the paper. Com- 
putation complexity and illustrative examples are given. It is demonstrated that 
using the standard MCNC benchmark examples the method outperforms both 
the Kemighan-Lin method and the spectra method. 

The partitioning algorithms developed so far have been used widely in real- 
world problems. In most commercial placement software, partitioning is one 
of the fundamental structures used to find a global solution. By its nature, it is 
a hierarchical divide and conquer technique. It suffers from the lack of details 
inside each partition. As a result, partitioning is always followed by other local 
optimization techniques to improve its results. It also uses a simple representa- 
tion for the problem, i.e. minimize the signal crossing the partition boundaries. 
While this is an easy objective function, it is more complicated when one has 
to consider timing of the circuits, and, in addition, it is much more complicated 
when the distributed interconnect topology impacts timing as well. Recently, 
there are some interesting works on multilevel techniques [21, 22]. The field of 
partitioning will need more advancement to address these new issues. 

2.2 Placement (with comments on [2]) 

When the placement problem was first introduced in physical design, it was 
formulated as placing cells of point dimension for a given connection specifica- 
tion in terms of an incidence matrix or connection matrix. The m point cells are 
placed on a rectangular array evenly spaced locations. This is often referred as 
the quadratic placement problem. In early days of design automation quadratic 
placement was needed for both the standard cell and gate array designs. Vari- 
ous methods were used, however it was not until the program Gotdian [2] was 
introduced that the industry began to realize the virtue of a good placement 
method. The key to Gordian’s success is the combination of bipartitioning and 
a global optimum placement, used top down until all modules are placed. Early 
approaches of using the same strategy includes the work on min-cut partition- 
ing and resistance network solution [23], and the Proud Program [24]. Proud 
differs from Gordian in that each placement optimization imposes a boundary 
constraint instead of a centroid constraint, i.e., the center of the region is used 
to guide the optimization process. The implementation of the algorithm is su- 
perior and the quality of the result is excellent. Both Proud and Gordian have 
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been used in commercial products. In addition, Gordian has the capability of 
handling modules of finite area by means of an exhaustive slicing optimization. 

Later it became obvious that in physical design some consideration must be 
given to timing as well. Several contributions appeared [25, 26, 27], among 
them Ritual [27] represents a combined continuous and discrete part. It intro- 
duces timing constraints on total path delays in global placement, followed by 
discrete space optimization such that one cell remains in a region. The former 
is accomplished by the method of Lagranging relaxation, and the latter by hier- 
archical partitioning. 

Another problem, which has emerged but still unsolved is the consideration 
of the effects of interconnect delay on placement. In deep sub-micron design, 
circuit delay is dominated by interconnect delay. Interconnect wires also occupy 
sizeable part of the chip area. Thus, it is important to consider interconnect 
topology in addition to the gate placement. An attempt has been made in this 
direction [28] and may prove to be fruitful in the future. 

Among all the papers, the Proud and Gordian papers have made most signif- 
icant impact on the EDA industry. Today essentially all commercial placement 
tools are based on this technique. It represented a major breakthrough in stan- 
dard cell placement technology in the last 15 years. More recent contributions 
include [29, 30, 31, 32]. As the complexity of chip permits the integration of 
over 50 million transistors on a single chip, the so called System-On-a-Chip 
(SoC) has become a reality. Along with the shrinking of pfocess geometries are 
the added layers for interconnect. Today 8 to 12 layers of metal or copper on the 
chip for interconnect are very common. The placement problem becomes one 
that mixes many large pre-existing blocks with different shapes plus millions 
of small standard cells. Furthermore some of the blocks allow signals to route 
over them and some do not. This calls for new approaches for the placement 
problem. 



2.3 Floorplanning (with comments on [3, 4]) 

Floorplanning can be considered as a generalization of point (cell) placement 
discussed in the last Section, in which point cells become modules of finite 
area and given aspect ratio. In early days, it is also referred to as building- 
block placement [33]. Given m rectangular modules of fixed height and width, 
a floorplan is a non-overlapping placement of the m modules into a rectangular 
region. Like point cell placement, an interconnect specification among modules 
is given in terms of incidence matrix. The objective is to find a floorplan with 
minimum total area among all possible floorplans. Some authors also consider 
the case that the cost function is a combination of area and the interconnect. 
Additional constraints may be imposed, for example, upper and lower bounds 
of the aspect ratio. Paper [3] deals with optimization using simulated annealing. 
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It is well known that simulated annealing is time consuming and, in general, 
cannot find the optimum solution. Paper [3] proposes a clever algorithm to 
speed up the process. It, however, does not give any example, neither does 
it discusses the computation complexity. Otten had done previous good work 
on floorplanning such as slicing floorplan, shape functions, and the polynomial 
time algorithms [34, 35]. 

In contrast, paper [4] introduces a brand new theory on floorplanning us- 
ing sequence pairs. Given m modules and a floorplan, there exists a sequence 
pair which consists of two sequences, one represents a permutation of the other. 
Given a sequence pair and the shapes of all modules, there is a unique topol- 
ogy for the floorplan. The topology here refers to the fact that the blocks can be 
shifted locally without changing the relative positions between the blocks. Thus, 
the solution space of the optimum floorplan is finite, i.e. (/n!)^. Furthermore, 
each sequence pair can be mapped to the unique floorplan in 0{m^) time, and at 
least one of the (m!)^ floorplan represents the optimal solution. The proof is by 
means of two constraint acyclic digraphs defined by the floorplan. Several large 
examples are illustrated. The above can be generalized to include soft modules 
(arbitrary aspect ratio) and pre-placed fixed modules. 

The publication of sequence pair representation inspired many new develop- 
ment in floorplanning. In [36], Nakateke et al. introduced a bounded slicing grid 
(BSG), a n-by-n checkerboard grid, for a list of n blocks. More recently, Gao 
et al. [37] proposed an ordered tree structure (0-tree) for a packed floorplan (all 
blocks are packed toward one comer of the floorplan). The number of O-trees is 
smaller than the number of slicing trees and yet it guarantees to include all opti- 
mal packing of rectangles. This means floorplanners based on O-trees have the 
speed of slicing-tree based floorplanners and yet produce higher quality floor- 
plans compared to those by sequence pair and BSG. In [38, 39], a Comer Block 
list together with an equivalent representation for a mosaic floorplan was pro- 
posed that covers all slicing and non-slicing floorplans with no empty room. In 
[40], Yao et al. demonstrated that a mosaic floorplan has a one-to-one corre- 
spondence with a twin binary tree. 

In chip design, one starts with chip planning based on various specifications 
such as speed, power, etc. Often IP blocks are used as major components. These 
and other modules and components represent a floorplanning problem. The ad- 
vance in floorplanning in physical design discussed above is of great value to 
the process. 

Today, commercial floorplanning tools for a real-world design has to con- 
sider three additional elements: layout hierarchy planning, power distribution, 
and clock distribution. First, due to the high integration in a SoC design, more 
often than not, one has to do some form of hierarchical physical design, even 
though most designers avoid hierarchical design due to its inefficiency in die 
size and high complexity in generating and maintaining the hierarchical bound- 
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ary constraints. Although the hierarchical design problem has existed for many 
years, there has been very little research advancement in this area. 

The second real-world concern is to minimize the total power consumption. 
Among several techniques, clock gating by turning off portions of a design dur- 
ing idle period and applying multiple voltage levels for different parts of the 
chip are common practices. With the large electrical current required, so is the 
area used for power distribution on a chip. The quality of power distribution is 
very critical for the reliable operation of the circuit. The large area required for 
power distribution, the different voltages and different power networks for dif- 
ferent functions, and the need for quality power source at every circuit locations 
make power distribution task extremely tedious and challenging. 

The third element of real-world floorplan design has to do with the clock dis- 
tribution network. Clock distribution network consumes almost half of the chip 
power. A good clock distribution scheme, either zero skew technique as 2.4 or 
mesh distribution with pre-designed clock buffer rows, becomes another critical 
floorplan design task. In order to lower the risk of power supply noise created 
by these large clock distribution networks on a chip, many designs require a 
separate power distribution network for those clock circuits. 

As a result of these real-world considerations of design hierarchy, power 
distribution, and clock distribution, to create a good floorplan tool has been a 
continuing challenge for electronics design automation industry. All available 
commercial floorplanning tools today offer highly interactive environments with 
functions for power and clock planning, timing budgeting capability for hierar- 
chical design, block placement, soft block pin assignment, global routing and 
congestion analysis, and voltage drop analysis. This expanded definition of 
floorplanning covers a wide variety of issues and hard to abstract it to a sim- 
ple problem definition for the researcher. In reality, before one sees a better 
floorplan, one does not know it exists. One can easily see great impact of a 
good floorplan versus a bad floorplan and this is an area worthy of significant 
development in the future. 

2.4 Zero Skew (with comments on [5]) 

In VLSI design, cycle time is one of the key specifications because it affects 
the timing performance. Improper clock skew could cause system malfunction. 
The clock period depends on the worst-case path delay and, in a crucial way, 
the clock skew. The H-tree clock routing is perhaps the most commonly used. 
Early work on minimizing clock skew includes heuristics on balancing the wire 
lengths [41, 42]. In paper [5] , the problem is attacked head on for the aim of 
obtaining zero skew. The author was successful in obtaining zero skew based 
on computing delay using the Elmore delay model, and a depth first search al- 
gorithm. Next, it uses a recursive bottom up algorithm to balance the delay at 
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each junction based on the delay calculations of sub-trees. The paper also in- 
cludes a generalization to buffered RC trees, and provides numerical examples 
to compare with earlier results based on balancing the path length. 

Assuming one has solved the zero skew problem, it is natural to extend this 
technique to control the exact time when each clocking element receives the 
clock signal. With many of the designs pushing the performance limits of the 
process technology, “Retiming” by moving logic function across flip flop or 
latch boundaries in order to meet clock cycle time has become a technique of 
interest. However, moving logic function across storage elements impacts for- 
mal verification and test processes. A different but equivalent technique is to 
skew the clock intentionally to accommodate the longer paths by borrowing the 
allocated cycle time from upstream or downstream neighboring logic across the 
storage elements. This is sometimes called “useful skew”. Useful skew tech- 
nique removes the impact on formal verification and test process at layout stage 
in a design process. It has generated some interest in the design community. 
However, most designers view this technique as interesting but few have actu- 
ally used it in real design. For those who have used, most apply it to very small 
portions of a design. 

There are two main concerns from chip designers on zero skew or useful skew 
technique: (1) the transistor parametric variation across the chip; (2) the accu- 
racy of delay models. Designers worry about the transistor behavior changes 
due to the supply voltage variation and process parameter changes across the 
chip. They also worry about the deep sub-micron effects on the accuracy of El- 
more delay model. Using clock meshes and special clock buffers distributed in 
a regular pattern across the chip to make sure storage elements near by will not 
have large clock skew is still a common practice. With other consideration as 
described in 2.3, the clock distribution is no longer a standalone problem. It is 
important to consider clock design in both floorplanning and placement stages 
and not just a separate design task. 

2.5 Goalie [6] 

Physical verification has been one of the first commercial application avail- 
able from EDA industry since early 1980’s. It does the most important process 
design rule checks before a design is “taped out” to manufacture facility. As 
the process technology advances, physical verification tool has to address much 
larger design with hundreds of millions of geometry. Furthermore, the process 
rules are becoming very complex with many conditional rules. Goalie is fo- 
cused on memory efficient algorithms to enhance the capacity of the tool while 
maintaining flexibility for complex process rules. 

Commercial physical verification tools have gone through several generations 
of major development since Goalie. The most notable development is in the 
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area of hierarchical processing of a design instead of flattening all the design 
data first. Since most modem layout designs are structured hierarchically and 
made of limited set of pre-designed cells, new generation of physical verification 
solutions at 2002 have significantly improved the capacity and mntime by at 
least two orders of magnitude by exploiting the repetitive nature of the designs. 

Physical verification tool is fundamentally different from physical design 
tool. Verification tool focuses on accuracy, capacity, and speed of execution. 
The correctness of verification tool is most important. Physical design tool, on 
the other hand, focuses on finding the optimal quality of result in terms of die 
size, timing, and power, in addition to software capacity and mntime. The most 
challenging aspect of physical verification has been “logic versus schematic” 
(LVS). LVS is a process to verify that the final layout has faithfully implemented 
the intended logic design by extracting schematic from layout and comparing it 
to the original logic description. Although much advancement has been made in 
geometry processing, it remains to be a challenging task for software to identify 
the source of the LVS problem when a design mismatch happens. 

3. Physical design and logic synthesis 

Traditionally, logic synthesis uses “wireload” model to consider interconnect 
delay during logic synthesis process. Wireload model is a statistical model ex- 
tracted per technology process per cell library to estimate maximum wire length 
for each signal. It has been successfully used for many years until the early 
1990’s when interconnect delay became dominant and causes major timing in- 
consistencies between logic synthesis tools and physical design tools. Since 
then, wireload model became a term everyone blames for timing closure prob- 
lems. 

In early 1990’s, as the world of physical design moves to consider timing 
impact due to interconnect dominated delays, most consider physical solution by 
moving cells or blocks closer and modifying routing topologies to accommodate 
timing critical paths. For those timing issues that cannot be fixed by physical 
design techniques, a feedback to logic synthesis tools is necessary. The feedback 
process from physical tool to logical tool takes very long and tedious manual 
work and, worst yet, in many cases it does not converge in timing, i.e. logic 
synthesis tool thought the timing problem is corrected and passes the design to 
the physical tool but the physical tool still cannot find a feasible solution. Some 
designs required over 30 iterations between physical design and logic synthesis 
tools and still cannot reach a solution after many weeks of trials. 

A significant industry-wide development to address this issue took place since 
1994 at both physical design and logic synthesis EDA companies. This is what 
many companies touted as “SinglePass” or “RTL-to-GDS” solution — merging 
the logic synthesis and physical design in one process. Although this is a noble 
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goal, in reality, the problem is far from being solved even at year 2002. To- 
day, there are only some successful convergence techniques that use incremen- 
tal buffering and cell sizing techniques from traditional logic synthesis tools 
combined with incremental placement and routing capabilities. This solution 
eliminates many of the iterations between logic synthesis and physical design 
tools. However, it does only local and incremental optimizations. It does not 
improve the quality of the original design structure. Its goal is to find a feasible 
physical solution that does not deviate too much from the original goals set by 
the logic design process. Furthermore, it is based solely on heuristics without 
much theoretical foundation and it suffers from severe capacity limitations and 
extreme long runtimes. With the world moving to higher complexity designs 
and smaller geometries, this is clearly an area for new research. While this ini- 
tial solution is effective in local convergence, it is far from the original goal of 
SinglePass or RTL-to-GDS solution that one shall expect. The industry needs to 
move the physical effects to higher level of logic synthesis processes to gener- 
ate best logic structures to match the best physical design at the same time. Up 
to this time, logic synthesis tools have not made much improvement in it tech- 
nology. In the years ahead, logic synthesis tools need significant fundamental 
improvements in its capacity, quality of result, and runtime. Eventually logic 
synthesis and physical design processes will be combined in one. 

It is interesting to observe that while the EDA industry is investing heavily to 
improve physical design tools to include incremental logic synthesis, there have 
been very little research publications on this topic. 
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Abstract 

An application of simulated annealing to floorplan design or macro placement is described. It 
uses a minimum size configuration space without invalid solutions, a constant object function, 
an adaptive control schedule, and an indicator for proper convergence. Fast convergence and 
improved flexibility are the salient features of this approach. 



1. Introduction 

Floorplan design and the context for which the algorithms of this paper were 
developed have recently been described [3]. Section 1.1. briefly summarizes 
that discussion, using the terminology of that paper. Simulated Annealing* was 
introduced into layout design by Kirkpatrick, Gelatt, and Vecchi [1]. In sec- 
tion 1.2 a slightly restricted version of that algorithm is described, together with 
some basic facts that can be found in literature, though somewhat scattered. The 
restrictions make an automatically adaptive schedule possible, and the described 
implementation can therefore be used in a silicon compiler. 

1.1 Floorplan design 

A floorplan is the topology (a set of neighbor relations) of a partition of a 
geometrical plane figure. In this paper the geometrical figure is a rectangle, and 
so are the elements of the partition. Such configurations are called rectangle 
dissections. In layout design floorplans are used to capture data at intermediate 
stages of the design. In that context the enclosing geometrical figure represents 
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a module, sometimes the whole system, and the elements of the partition rep- 
resent its submodules. With the restriction to rectangle dissections floorplans 
can be represented by polar graphs. The task of a floorplan design procedure is 
therefore to produce a suitable polar graph. 

The data available to a floorplan design procedure consists of data more or 
less directly derivable from the input for the layout design program, and envi- 
ronment data estimated in previous stages of the layout design. Among the first 
kind of data there is always a part that can be called ’proximity data’. It some- 
how indicates which modules are preferably kept close together. Very often this 
data is present in the form of an incidence structure called a net list. 

Also in this first group of data is the information about the shapes of the sub- 
modules: r assigns a shape constraint [3] to each module. Usually there is quite 
some freedom in laying out these submodules. This is what makes floorplan de- 
sign in some sense a generalization of placement, where fixed objects have to get 
a location and an orientation. Though fixed shaped objects must be acceptable 
to a floorplan design method, it differs from classical placement algorithms in 
its capability of handling flexible shape objects as well. A previously designed 
environment may imply data such as a good shape for the whole module, and 
indications concerning the position of entry points of certain global nets. 

1.2 The annealing algorithm 

An annealing algorithm works on a state space, i.e. a set S of states, on which 
a topology is defined. Each state represents an encoding of a configuration. An 
object function e : 5 M assigns a real number to each state. This number is 
interpreted as a quality indicator, in the sense that the lower this number the 
better is the configuration that was encoded in that state. By defining a set fi of 
neighbor relations over S (i.e /i C J x 5) a topology is endowed to the state set 
S- This relation is required to be symmetric and antireflexive, and its transitive 
closure has to be J x S- The elements of jx are called moves. 

A selection probability P : (0, 1] C K is assigned to each move. It satisfies 



and 






X = 1 

s/esfi 



— P(V)5)j 



( 1 ) 

( 2 ) 



Another probability function, called the acceptance probability, depends not 
only on the move, but also on a positive real number, the control parameter. This 
function a : jtt x R+ (0, 1] has the following properties: 



(■*) ) 0 is monotonous in t] 



(3) 
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e{s) < e(j/) => lima((5, j/),f) 0 
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'^(s,si)en'^(si,sii)€n'^teR+ ^ < &{sf) < t(sl/)) => 

=> a{{s,s/),t)a{{sf,s//),t) = a((s,s//),r)] (7) 

The following (infinite) Markov process uses the above probability functions 
to move from one state to another. 



select present A state 
from S 
at random', 
repeat 

select nextAstate 
from presentAstate/i 
according to P; 

if random([0, 1]) < a((presentAstate,nextAstate,t) 
then presentAstate := nextAstate 
until false; 



This process has a number of important properties: 



PI. The relative frequency of steps with presentAstate= s is 

a((so,s),r) 



5(^,0 = 



1 4” Xj/€5\{5„} 



( 8 ) 



where e(so) = minjg5£(5). So, these relative frequencies are independent 
of pi The function 5 : 5 x R+ -> [0, 1] is the equilibrium distribution. 



P2. For a sufficiently high value of the control parameter t the relative frequency 
is the same for all states. 



P3. For sufficiently low value of the control parameter t presentAstate is almost 
exclusively a state with e « e(so)- If So is the only state with z{so) (i.e. 
e has a unique global optimum over S, presentAstate will be s„ for an 
arbitrarily large proportion of the time for t low enough. 

Two aggregate functions are useful in characterizing the process. The first 
one is the average value of e during the process: 

E{t) = (e) = ^6(s,t)e(5) 



(9) 
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The other parameter, called the entropy, is defined as 

= ( 10 ) 

s&S 

It is very easy to calculate the entropy for high values of t, because of property 
P2. Its value will be 

5cc = /n|5| (11) 

If the system has only one global minimum the entropy will come arbitrarily 

close to 0. If there are several global minima the entropy will approach 

SQ = ln\{seS\e{s)=E{so)}\ (12) 

The goal of the annealing algorithm is to find a state s with e(s) close to e(i<,)- 
For a very low t the process would be almost all the time in such a state after 
reaching equilibrium, but it would take a huge number of steps to reach that situ- 
ation. The algorithm therefore performs the above process for several, in general 
decreasing, values of t, each time of course with a limited number of steps. This 
number should be large enough to approach the equilibrium characteristics. It 
is relatively low for high values of t, and from an equilibrium situation for a 
certain value of t to one at a slightly different t. The decrements in t and the 
number of steps per value of t are the parameters that characterize the schedule. 
The schedule used in this application of annealing is described in section 4. 

2. Problem formulation 

In this section the relation between the original problem, floorplan design, 
and the annealing algorithm is established. First the correspondence between 
the states of the annealing algorithm and floorplans is discussed. Then, in sec- 
tion 2.2, the translation of floorplan ’quality’ into an e function is illustrated. 

2.1 The state set 

Each state in S represents a feasible solution to the floorplan problem. It is 
in some sense an encoding of the floorplan configuration. This implies the exis- 
tence of an algorithm to derive an associated floorplan from each state. To keep 
the state space from being unnecessarily large, the encoding should preferably 
be bijective, i.e. each state represents exactly one floorplan. Also, the encod- 
ing should be easy to handle, in the sense that generating configurations from 
others (i.e. implementing the moves) and evaluating configurations can be done 
efficiently. 

Previous approaches [2] to floorplan design have shown the usefulness of 
point configurations as intermediate structures to carry the relevant topological 
as well as geometrical information. The points represent the submodules. Al- 
gorithms transforming such a point configuration into a floorplan while quite 
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accurately preserving these relative positions in the axes’ directions have been 
described. The distances between the points form geometrical data that can be 
used for evaluating the configuration. 

The task of the annealing algorithm will be the generation of such a point 
configuration. Point configurations, however, are not the internal representation 
of the states. That would lead to a huge state space even if the points were 
required to take a limited number of positions (for example the vertices of a 
grid). Instead it uses orderings implied by the point configurations. It uses one 
sequence of modules for each dimension. The interpretation of s is therefore a 
function (or array) with being the module in position / of the sequence of 
the indicated axis. 



2.2 The object function 



Many deterministic optimization algorithms depend heavily on the object 
function. The eigenvalue methods for floorplan design are examples of such 
algorithms. The annealing algorithm, however, is quite flexible with respect 
to the object function. To keep the discussion from becoming too abstract the 
implementation of one, rather specific, but frequently used, quality criterion, to- 
tal wiring length, will be described in this section. Also, to avoid a confusing 
amount of detail, the description is restricted to its simplest form, i.e. all mod- 
ules are flexible, no external nets, aspect ratio 1, no weights on the nets, the same 
weight on the directions, etc. The generalizations, however, are mostly straight 
forward. 

Unlike the deterministic optimization algorithms annealing requires the ob- 
ject function to be a good approximation of the real objective, because the result 
will most likely be a state with an e close to 8<„ but possibly quite different from 
So- To estimate the length of the wires geometrical rather than topological infor- 
mation is needed. A point configuration based on the the size of the submodules 
is therefore derived from the sequences. 



axis{m) = 



ir(m)-HS,<,-i(„)r(s(/) 



( 13 ) 



The length of a net is estimated by half the perimeter of the smallest rectangle 
enclosing all the points representing submodules connected to that net. This is 
a lower bound for the length of the Steiner tree in the plane with rectilinear dis- 
tances. This object function has to be evaluated many times during the annealing 
process. An efficient computation of the estimates of the net lengths for each s 
is therefore highly desirable. This efficiency is obtained by redoing only part of 
the computation and retrieving the rest from a precomputed data structure. This 
data structure is updated, rather than recomputed, after each move. It contains 
the range of the nets in both directions. So, for each net the first and the last sub- 
module in each sequence, sharing a pin with that net, are stored. The notation 
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is faxisin) and laxisip) When a move is made only the ranges of the nets that are 
connected to one of the involved modules have to be updated. 

If the sum of the net length estimates is chosen as the object function, a good 
solution for one axis is also a good solution for the other axis, and if there 
are only four global optima the corresponding point configurations have all the 
points on a diagonal. To avoid such solutions, the sequences are forced to be 
close to mutually orthogonal which means that the correlation between the po- 
sitions in the two direction must be low. 

- 3|«^l i\^\ + 1)" 

*>= 

This leads to the following objectfunction; 

e(5) = (1-t-p) ^ [x{lx{n)) -x{fx{n))+y{ly{n)) -y(4,(n))] (15) 

neoi 

3. The moves 

The topology of the state space is given by its moves. Although the topology 
does not influence the equilibrium distribution, it has a considerable effect on 
the convergence properties of the annealing process. 

3.1 Selection 

The effects of a good set of moves must be easy to compute, because many 
moves will be tried in the annealing process. For a good convergence all states 
must be easy to reach, that is it must be possible to reach any state from any other 
state with a small number of moves. Yet the difference between the values of 
the objectfunction of two by a move connected states has to be relatively small. 

The move used in this implementation is exchanging two modules. The num- 
ber of positions between those two modules plus one is called the length of a 
move. The maximum length of a move is controlled by a parameter L. For ob- 
ject functions like the one in (15) the change in e tends to grow with the move 
length. Decreasing the maximum length of the moves therefore mostly reduces 
the maximal difference in e of the two connected states, and thus increases the 
proportion of accepted moves, but it increases the diameter of the state space. 
For any given L all states have the same number of possible moves. If these 
moves are chosen with equal probability the requirements of (1) and (2) are sat- 
isfied. The equilibrium is therefore not affected by a change in L (this not true 
for changing e during the annealing). It also means that the entropy of the con- 
figuration for high t can be easily computed. For all states are equally probable 
when t is high enough. 



Soo = 2ln\^M\ 



(16) 
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3.2 Acceptance 

There are several functions that satisfy the requirements (3 - 6) and the pre- 
ceding sections do not imply any specific choice. The commonly adopted func- 
tion is 

a((5,s/),t) = min|l,exp^-^^^^^j-^^| (17) 

This function has a number of properties that can be used advantageously in 
controlling the schedule. Firstly, it enforces a simple relation between E and S: 



dt ^ dt 



(18) 



where 6^ = |(e^) - (8)^|. This makes it possible to monitor the entropy during 
the annealing if a quasi-equilibrium is maintained during the whole process. 
Another property that is a consequence of selecting (17) is that if e is normally 
distributed over the states, then this normal distribution is not only realized in 
equilibria for t but for all values of t, with a sufficient number of states 
with e close to E. Moreover, cr is the same for all these equilibria. 

The average change in the objectfunction Ae(/) can be tabulated as a function 
of the length of the moves. The probability of accepting an e-increasing move 
is kept more or less constant during the process by controlling L: 



L 



/=! 




(19) 



4. The schedule 

Information concerning the values of the control parameter for which the 
Markov-process of section 1.2 is simulated and for how long is called the sched- 
ule. It should specify the initial value of the control parameter, the decrements 
in that value, and when to stop the annealing. 



4.1 Initialization 

The initial probability of accepting the move with the biggest change in e 
must be reasonably high. By approximating this maximum change in the object 
function, an initial t can be calculated for a given probability 



max(Ae) 



( 20 ) 



Usually it is not difficult to approximate max(Ae), but, if not done empirically, 
the method depends on the problem. In the case of floorplan design with total 
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wiring length as object function, the maximum change over a move certainly is 
smaller than 

2 X length{longest axis) x max(#pins). (21) 

M 

Since initial values for £'»o and 0^ have to be calculated, the obtained initial t 
can be checked against another condition, namely t ^ c . This follows from 
the requirement that E{t) must initially be much closer to Eoo than the standard 
deviation 0 in order to be able to reach all states easily, and the fact that for high 
values of t, assuming a normal distribution for the e(presentAstate), E depends 
on t according to 

0^ 

E{t) = E^y (22) 

with 0 ^ independent of t. This 0 ^ is determined empirically by moving freely 
through the state space, calculating e for each of the states visited, and at the 
end setting £’«, to (e) and 0 ^ to | (e^) - £'£|. If the value of t calculated with (20) 
does not exceed 0 by a considerable amount, t has to be increased. 



4.2 A stop criterion 



If the decrements in t are small enough to keep the process in quasi-equilibrium, 
a fairly general stop criterion can be used. It derives from the observation that 
the improvement possible by lowering t further must be much smaller than the 
improvement obtained by decreasing t stepwise from its initial value. The latter 
equals E^ — E{t). An upper bound for the improvement still possible, if t is in 
the interval with a positive second derivative of E, is t (dE/dt) . Using (12) this 
gives 



t{E^-E{t)) 



< 0 , 



(23) 



0 being a very small positive number. Outside the interval the upper bound is 
not valid, but the left hand side of (23) is close to 1 for high values of t, and 
drops slowly until it enters that interval. 

The above criterion is only reliable if the process was kept in quasi-equilibrium 
during the annealing process. An indication for this condition is the value of S, 
the entropy. The value of S at high values of t is close to 5oo, which can be cal- 
culated with (16). The decrements in S can be calculated by using the relation in 
(18). The value of S when the stop criterion is satisfied, should be close to So- If 
the schedule is too fast the process is likely to get trapped in a local minimum, 
for a while or forever. In the latter case S will stay much too high. In the former 
case 5 will drop at too low a value for t, and consequently drops quickly and, 
finally, below So- Of course, both phenomena can occur in the same process, and 
the S may be close to So when the stop criterion is satisfied, without convergence 
to a global minimum. 
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4.3 Control 

The schedule must be controlled such that the process stays in quasi-equilib- 
rium and yet converges fast to a global optimum. This has to be achieved by 
determining the decrements in t, and the number of moves per value for t. Ex- 
perience learned that the best results were obtained by keeping the number of 
acceptances proportional to the change in the entropy. So if the number of 
moves per selected t-value is constant, and big enough to obtain useful infor- 
mation about E{t) and a^{t) the decrement in t is completely determined up to 
a constant factor: 

At oc — (24) 

The factor can be chosen such that the steps do not disturb the distribution func- 
tion too much. For example, 



6(s,t — At)) 




which for the higher values of t is satisfied if 



|At|< 



t^lny 

maxjE-l-tlnY 



(25) 



(26) 



From (24) and (26) an expression for the factor can be derived. Using the max- 
imum value of the object function is, however, over-pessimistic and leads to 
unnecessary slow schedules. In the program it is therefore replaced by 



Notes 

1. C.D.Gelatt jr. et.al.: “Optimization of an organization of many discrete elements”; Patent to issue. 

2. This is the notation of [3]: is the set of modules, the set of nets, and fP the set of pins. In this 

paper F simply assigns an area to a module. 
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module ANNEAL; 

import T, ^^0 , F, A, £<», AEx, Ae^, o^,t,s; 

consty,C,'n>6; 

□ 



begin := S; ^ — 1; 

5oo := 2151^1 In (|fA^|) + (2jtlfW|); 



So '■= In (# global minima ); 

• a^Iny 

repeat^ := ^+1; E := 5 := 5oo; 

L, := Ly ;= \9^\-V, 
repeat e := 0; esq := 0; 
for /i := 1 to |51f| do 
ojcw :=random(j:,y); 



i := random(l, • • • , |fW|); 

repeat j := random(l, • • • , |fWD 

until {i ^ ;■) A {\si[i\ - 5i[;]| < Laxis))', 

S 2 := swap(si [i] , si [j] , axis)-, 

if random([0, 1]) < exp 

then := S 2 ‘, 



if e(5i) < e(S) then s := si end 
end; 

^ •= == 
end; 

AE := e-E, E := e) := \esq-E^[, 

J-'cais •“ 

min ({ |5lf| - 1 } U {l|L-> Y.U exp < c}) ; 

S := 5+^; 



until 

until 



t{E„-E(t) 



< 0 ; 



s-s„ 

s^-s„ 



<r\ 



end; 



GOALIE: A SPACE-EFFICIENT SYSTEM 
FOR VLSI ARTWORK ANALYSIS 



Thomas G. Szymanski^ and Christopher J. Van Wyk^ 
AT&T Bell Laboratories 
Murray Hill, NJ 07974 



Abstract 

This paper deals with the algorithmic foundations of the GOALIE artwork analysis system. 
GOALIE includes programs for boolean geometric operations, connectivity analysis, transistor 
extraction and measurement, circuit extraction, parasitic capacitance measurement, and design 
rule checking. One of our major results is showing how the expected main memory requirement 
for all these tasks can be limited to 0{y/n) where n is the number of edges in the artwork, while 
still running at least as fast as previously published algorithms. GOALIE can therefore handle 
large layouts on small computers, or even on personal workstations. 



1. Introduction 

The first step in processing a layout with GOALIE is to flatten the layout’s 
hierarchical description (given, for example, in GIF [8]) into a set of files, each 
of which represents the geometric regions on one mask layer. Boolean geometric 
operations are then used to remove overlaps between figures within a layer, find 
transistor channel regions, and derive the set of “wires” comprising the circuit. 
Connectivity analysis is then performed to assign electrical net numbers to each 
wire (or piece thereof) and transistor extraction is performed to determine the 
terminal nets and channel dimensions of each transistor. If desired, the parasitic 
capacitance of each net can also be determined and design rules can be checked. 

Each geometric region is represented by its edges, where an edge is described 
by its endpoints, (xi,yi) and (x2,y2), with either < X 2 (non- vertical edges) or 

x\ =X 2 andyi < y 2 (vertical edges), along with an orientation field that indicates 
on which side of the edge the region lies. An edge may also contain additional 
information such as the number of the electrical net to which it belongs. A set 
of regions is represented by an edge file, containing the edges of all regions in 
canonical order, viz., lexicographical order on (xi,yi) pairs, with ties broken 
by the edges’ slopes. Edge files are the basic inputs and outputs of the various 
operations available in the GOALIE system. 



Authors are currently with ' Avaya Labs Research and ^Drew University. 
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2. Geometric Operations 

GOALIE performs boolean geometric operations such as union and intersec- 
tion using edge-oriented scanline algorithms as described, for example, in [6, 
12]. These algorithms can be implemented to run in O(nlogn) time and 0{\/n) 
expected space. We use the method described by us in [14] to sort the output 
edges into canonical order, that is, the output edges are first written into a tem- 
porary file that is then read backwards and passed through a priority queue to 
produce the final output file. This strategy is significantly more efficient than 
applying a general purpose sorting procedure and has the added advantage of 
using only 0{^/n) expected space. 

Our implementation can produce any boolean combination of two or more in- 
put edge files; moreover, GOALIE can produce several different boolean com- 
binations at once. For example, in an NMOS process one might need to find 
both the intersection and the difference between diffusion and polysilicon lay- 
ers to obtain the transistors and diffusion wires, respectively. GOALIE does 
both of these operations during a single boolean operation on the diffusion and 
polysilicon edge files. 

3. Connectivity Analysis 

A key algorithm used in GOALIE and recently described by us in [14] is 
a space efficient algorithm for connectivity analysis. Previous algorithms for 
this task [2, 4, 5, 7, 9, 10, 11, 13], have either kept all the edges of each net in 
main memory until the net is complete, or else gathered some “global topology” 
information in a table and used this information to renumber the edges in a 
separate pass. Both of these approaches take 0{n) space. 

For expository purposes, let us first consider the simpler problem of region 
analysis, that is, the problem of assigning the same number to each edge of the 
same connected region of a single set of polygons. Our method begins with 
a standard scanline algorithm and augments it with a “union-find” data struc- 
ture [1] to maintain sets of output edges that have been discovered to belong 
to the same connected region in the half plane to the left of the scanline. As- 
sociated with each set of edges is a temporary region number. Whenever an 
input edge is discovered to be an output edge, it is placed in a new singleton set 
with a new temporary number. Whenever the scanline passes beyond an output 
edge, the output edge is written to the temporary file, tagged with the temporary 
number of its set. Whenever two output edges are discovered to touch, their 
corresponding sets are merged, a record is written to the temporary file giving 
the temporary nunlbers of the two sets involved, and one of the temporary num- 
bers is discarded. Finally, whenever the last remaining output edge in some set 
is written out, a special record indicating the temporary number of the set is 
written to the temporary file. 
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This temporary file has exactly enough information to allow final region num- 
bers to be assigned to each output edge during the backward sorting pass de- 
scribed in Section 2. The union-find structure, the renaming table used during 
the sorting pass, and the priority queue used for sorting, all require 0{^/n) ex- 
pected space and O(nlogn) time. 

It is straightforward to extend this method to perform connectivity analysis 
on several mask layers. We maintain a separate scanline for each layer while 
keeping all output edges in a common union-find structure. Edges are tagged 
with the name of their layer. For each contact window in canonical order, we 
advance all scanlines to the abscissa of the window, and then find an edge of 
the region surrounding the contact on each of the layers being connected. The 
sets containing these edges are then merged together and a record is written to 
the temporary file just as before. During the sorting pass the output edges are 
written to the output files that correspond to their layers. 

4. Transistor and Parasitic Capacitance 
Extraction 

In this section we shall describe how we extract and characterize transistors 
and parasitic capacitance from the artwork. We first need to introduce some 
terminology. A level is an edge file representing a set of geometric regions, each 
of which has an associated electrical net number. A level is said to be present 
at exactly those points in the plane that lie in the interior of some region in the 
level. Given k levels, the color of a point is the set of levels present at the point. 
We can therefore view the k levels as partitioning the plane into regions of 2*^ 
possible colors. 

We have developed a structure called a region-oriented scanline that makes 
it convenient to process the regions generated by a set of levels. The structure 
makes it simple to find the net of each level that is present in a region. Similarly, 
the lengths of edges and the areas of regions are easily obtained. This informa- 
tion is associated with the bottommost edge of a region and is maintained by a 
finite automaton traversing the scanline. The automaton also makes it simple to 
gain access to the information associated with all abutting regions. 

Region-oriented scanlines can be used to extract transistor connectivity as fol- 
lows: pass the region-oriented scanline over edge files containing diffusion and 
polysilicon regions (with previously assigned net numbers), transistor channels, 
and any relevant implants. For each transistor region, look for the net numbers 
on its covering polysilicon and on abutting diffusion. The record produced for 
a transistor includes the gate, source, and drain signals, any implant layers that 
are present, its area and its perimeter with diffusion. (From the last two num- 
bers the channel length and width can be computed for rectangular transistors.) 
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This version of transistor identification and interconnection is simpler and more 
robust than the method we described earlier [14]. 

The parasitic capacitance of an electrical net is approximated as a function of 
the net’s area and its perimeter abutting regions of different colors. By passing a 
region-oriented scanline over edge files containing the conducting layers and any 
relevant implants, we can find and report for each region its area, what electrical 
nets are carried on each layer that is present in it, and what its perimeters are 
with regions of other colors. The detailed description of regions output from 
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Figure 1. A diffusion runner under buried contact implant. 
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Figure 2. Areas and edges of the layout in Figure 1. 



this step makes precise capacitance extraction possible. For example, consider a 
diffusion runner passing through a buried contact region, as shown in Figure 1. 
Our approach would gather the following statistics for the regions shown in 
Figure 2: 

O Region B contains only diffusion; its perimeter with empty space is the 
sum of the lengths of edges 4, 7, and 1 1; its perimeter with both diffusion 
and buried contact implant is edge 8. 
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O Region C contains diffusion and buried contact implant; its perimeter with 
diffusion only is the sum of the lengths of edges 8 and 9; its perimeter with 
buried contact implant is the sum of the lengths of edges 5 and 12. 

O Region D contains only diffusion; its perimeter with empty space is the 
sum of the lengths of edges 6, 10, and 13; its perimeter with both diffusion 
and buried contact implant is edge 9. 

In calculating the capacitance of the diffusion runner, we could use a different 
coefficient for the areas of regions B and D than for C. We could also count B’s 
perimeter at edge 8 differently than that of its other edges, and C’s perimeter at 
edges 5 and 12 differently from that at 8 and 9. 

Let us consider the space requirements for these operations. A transistor can 
be written out to disk as soon as the scanline sweeps beyond its channel region. 
Since the expected number of transistors that cross any scanline is 0{y/n), the 
working space for transistor extraction is 0{^/ri). Similarly, the area and perime- 
ter statistics for a region can be output when the region ends; the expected num- 
ber of regions that cross any scanline is 0(Vn), so the expected working space 
for parasitic capacitance extraction is 0{^/n). 

The output from parasitic capacitance extraction consists of a separate record 
for each region. We use caching and data compression techniques to reduce the 
size of this file which typically occupies the same amount of disk space as the 
edge files for the connectable levels of the layout. Various postprocessors are 
available for converting this file into whatever form is required for simulators 
and other programs. 

5. Design Rule Checking 

In this section we discuss the algorithm used in GOALIE for checking that 
the boundaries of geometric regions are separated by some minimum clearance. 
This algorithm is used for both clearance and enclosure checking. It supports 
both the Manhattan and Euclidean metrics, a property not shared by algorithms 
that check for clearance violations by growing regions and checking for inter- 
sections. Indeed, the algorithm even supports non-uniform metrics where the 
required clearance between two regions depends on the slopes of their facing 
edges. 

For simplicity assume that the input to the algorithm consists of a set of dis- 
joint polygonal regions represented by a single edge file with electrical net num- 
bers attached to the edges. Without loss of generality, let the tolerance t be 1 and 
all edges lie within a bounding box of height H and width W whose lower left 
comer lies at the origin. Define swath i to consist of all points in the plane whose 
abscissa x satisfies i<x<i+l. 

Our approach is to check, for each endpoint p of every edge, whether another 
edge with a different net number lies within distance 1 of p. It suffices for any 
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endpoint in swath i to consider only those edges that cross swath i, or have an 
endpoint in swaths / - 1 or i + 1. The separation checking algorithm appears in 
Figure 3. 

for / 4- 0 step 1 until W do 

insert into S all edges whose left endpoint is in swath z ■ + 1; 
for each edge e with an endpoint p in swath z do 
check endpoint pofe against the edges in 5; 

endfor; 

remove from S all edges whose right end is in swath z - 1; 

endfor; 



Figure 3. Clearance checking algorithm. 



The structure 5 is a variant of the “segment tree” structure described in [3]. 
5 is a binary tree in which each node is associated with a subinterval of [0,H] 
and with a list of edges. All intervals are closed on the left and open on the 
right, and are of the form [2^z, 2-'(z + 1)) for some non-negative integers z and 
j. If z is even, then the nodes associated with the intervals [2^z, 2\i+l)) and 
[2^ {i -I- 1 ) , 2-' (z -1- 2) ) are the children of the node associated with p-'z, 2^ {i+2)). 
For example, the interval [16, 32) is the parent of intervals [16, 24) and [24, 32). 
An edge is stored in 5 by putting it on the list associated with the lowest node 
whose interval includes the ordinates of both of the endpoints of the edge. For 
example, the edge from (xi, 18) to {x 2 , 29) would be attached to the list on the 
node whose interval is [16, 32). 

To check an endpoint p = {x, y) of an edge e against the edges in S, it suffices 
to check only those edges on the lists of nodes whose intervals intersect the 
interval [y — 1, y + !]• These nodes may be found either on or immediately 
adjacent to the path from the leaf containing y to the root. Some of the details 
are shown in Figure 4. 



let I be the leaf of S whose interval contains y; 
while I is not the root of S do 

check pofe against all edges on the list of T, 
check pofe against all edges on the list of Ts left neighbor; 
check p of e against all edges on the list of /’s right neighbor; 
I <r- parent of Z; 
endwhile; 



Figure 4- Algorithm for checking edge e against S. 
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At this point, let us consider a number of implementation issues. First of all, 
the nodes of S can be organized in a “heap” data structure [1] and indexing can 
be used instead of pointers for moving up, down, or sideways in the tree. Sec- 
ond, the algorithm supports a variety of metrics because all tests for clearance 
violations are explicitly performed on pairs of edges. If the required distance be- 
tween edges is independent of their slopes, then the speed of the algorithm can 
be doubled by observing that any endpoint of an edge is actually the endpoint of 
at least two edges of the associated region. The process of checking endpoint p 
of edge e against S in Figure 4 need only be performed for the bottom endpoint 
of a left edge, the left endpoint of a top edge, the top endpoint of a right edge, 
and the right endpoint of a bottom edge. Third, the checks against the lists of 
neighbors mentioned in Figure 4 can be suppressed whenever the interval asso- 
ciated with that neighbor is further than 1 from y. Indeed, at the non-leaf nodes 
of S, this will always be the case with at least one of the neighbors. 

Let us next consider the resource requirements of the algorithm. Let n be 
the number of edges in the file being checked. Certainly the expected number 
of edges cut by any line across the design is 0{^/n). Moreover, the expected 
number of edges cut by any swath of the plane is also 0(i/«) where the constant 
of proportionality is a function of the tolerance t, the layout methodology, and 
the process technology. Thus the total number of edges held in S at any instant 
is 0{^/ri). Finally, we argue that both H and W have an expected value that is 
0{y/n). Since there are exactly 2H -I nodes in S, the total storage required by 
the algorithm is 0{^). 

It should be clear that the height of 5 is O(logn) and that an upper bound on 
the running time of the algorithm is 0{bnlogn) where b is the maximum length 
of any edge list associated with a node of S. We claim that the expected value of 
b is 0(1). We shall assume at this point that the design is free of any “minimum 
feature size” violations. Since each leaf I of S corresponds to a 3 x 1 rectangle 
in the plane, and all edges of I pass through this rectangle, it is clear that a leaf’s 
edge list can contain at most some constant number of edges without generating 
a small feature. Similarly, at a given instant, all the vertical edges associated 
with an interval [2^/, 2^{i+l)) intersect the line y = 2'^~'(2/-|- 1) along some 
horizontal interval of width 3. Thus the number of vertical edges in a node is 
also bounded by a constant. We have therefore accounted for all edges except 
diagonal ones. As it turns out, the number of diagonal lines in a single node can 
grow arbitrarily large if the geometry contains a large number of long diagonal 
edges. Although we have never seen this happen in practice, the problem can be 
fixed by simply storing long diagonal edges in a separate data structure. Further 
details will appear in a subsequent paper. 
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6. Implementation and Performance 

GOALIE is technology-independent and handles geometry at all angles, al- 
though our implementation treats manhattan edges more efficiently than non- 
manhattan edges. GOALIE consists of about 8000 lines of C code and has been 
used in the verification of several dozen chips at Bell Laboratories. 

In Figure 5 we present some typical performance data obtained on a DEC 
VAX 11/780. Time used is shown in hours, minutes, and seconds, while the 
maximum main memory used is shown in kilobytes. Because the four sample 
chips use four different technologies, the number of operations needed to ex- 
tract the circuit varied from chip to chip. Both NMOS and CMOS technologies 
are represented, and the design methodologies ranged from hand-layout to au- 
tomatically routed standard cells. The same chips have also been processed by 
GOALIE on an IBM 3081, a DEC VAX/750, and a SUN workstation. The space 
requirements remain about the same and the running times change by factors of 
about 0.1, 1.6, and 1.5 respectively. 



chip 


number 

of 

transistors 


number 

nonvertical 

edges 


percentage 

nonmanhattan 

edges 


number 
of metal 
edges 


1 


7,666 


177,407 


0.4% 


56,683 


2 


17,036 


627,153 


0.3% 


174,251 


3 


28,792 


726,297 


3.7% 


186,580 


4 


27,223 


929,446 


6.8% 


287,433 



chip 


determine 
transistor list 


compute parasitic 
capacitance 


check metal 
spacing 




ops 


time 


space 


time 


space 


time 


space 


1 


3 


11:25 


191 


4:21 


309 


0:57 


99 


2 


7 


58:37 


255 


17:07 


331 


3:01 


223 


3 


5 


1:38:22 


442 


36:51 


543 


3:11 


179 


4 


5 


2:53:15 


400 


1:22:48 


623 


4:56 


342 



Figure 5. Goalie Performance Data. 



The largest chip of which we are aware that GOALIE has extracted has 
137,000 transistors. The extraction took under ten hours on a DEC VAX 1 1/780 
with three megabytes of main memory and an operating system that does not 
page; the amount of memory actually used is not available to us. 
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Abstract 

A new placement method for cell-based layout styles is presented. It is composed of alternating 
and interacting global optimization and partitioning phases. In contrast to other methods using 
the divide-and-conquer paradigm, it maintains the simultaneous treatment of all cells during op- 
timization over all levels of partitioning. In the global optimization phases, constrained quadratic 
optimization problems are solved having unique global minima. Their solutions induce the as- 
signment of cells to regions during the partitioning phases. For general-cell circuits, a highly 
efficient exhaustive slicing procedure is applied to small subsets of cells. The designer may 
choose a configuration from a menu to meet his requirements on chip area, chip aspect ratio and 
wire length. Placements with high area utilization are obtained within low computation times. 
The method has been applied to general-cell and standard-cell circuits with up to 3000 cells and 
nets. 



1. Introduction 

A good placement is the major prerequisite for successful routing and effec- 
tive use of chip area. 

The complexity of the VLSI placement problem has forced the use of algorithms 
based on the divide-and-conquer paradigm. An important representative is the 
min-cut method based on graph-partitioning (e.g. [9]). But since min-cut al- 
gorithms are iterative improvement heuristics [8, 6], they depend on the initial 
partition. To get a good solution it might be necessary to select one partition 
computed from many randomly generated starting partitions [12]. 

Recently, algorithms have been studied that prefer another approach. The idea 
is to model the placement problem as a linear or nonlinear optimization prob- 
lem. Usually no starting solution is needed and all modules (cells) are treated 
simultaneously. Among these approaches are methods using physical (force or 
electrical network) analogies [1, 17, 3, 7, 16] and eigenvector methods [13, 2, 5]. 



*This work was partially supported by Deutsche Forschungsgemeinschaft (DFG) under grant An 125/5-2 
Authors are currently with Infineon Technologies AG and ^Technical University of Munich, Institute of 
Electronic Design Automation. 
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Some of these methods apply partitioning to recursively create smaller subprob- 
lems. However, they leave the simultaneous optimization after the initial step. 
In the following, a new placement method called GORDIAN is introduced. It has 
the unique feature to maintain simultaneity over all optimization steps. 

In section 2, we give an outline of the procedure. The ingredients of the pro- 
posed method are presented in sections 3 through 5. Space and time complexity 
is discussed in section 6. Results for standard-cell and general-cell circuits are 
presented in section 7. 

2. Outline of the Procedure 

The placement procedure GORDIAN is composed of alternating global opti- 
mization and partitioning phases. 

To illustrate the idea, suppose that all modules have been placed and a binary 
slicing tree has been created whose leaf regions contain m modules. A region 
represents a rectangular area of the chip together with the set of modules placed 
in this area. The position of each module can be described in terms of the co- 
ordinates of the leaf region containing it. The center of a region p is denoted 
by cp = (xp,yp). On the level of partitioning there are ^ < 2^ regions. The 
regions of level i are the sons of q/l father regions on level t — l each having 

a center Cp = » where /p is the area of region p, and p' (p") is the 

left (right) son of p. Thus, the center of the root region of the slicing tree can be 
determined recursively from its leaf regions. 

To create a placement from scratch, the centers of the regions have to be spec- 
ified in reverse order from the root to the leaves together with the centroids* of 
their modules. For the root region, one constraint is imposed on all modules. 
A global optimization is performed with this constraint forcing the centroid of 
the modules to the chip’s center. Then, the set of modules is bipartitioned and 
two subregions are created by dissecting the root rectangle. Their centers im- 
pose two new constraints on the modules replacing the single constraint of the 
root region. Then, the next global optimization is carried out. These global op- 
timization and partitioning phases are repeated until each module is assigned to 
its own region, step by step refining the placement of the modules (see fig. 1). 



^The centroid of a set of modules is the area- weighted mean of their center coordinates 
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level 0 





Figure 1. Stepwise placement refinement. 



More formally, the procedure can be stated as follows: 

Procedure Gordian 
level £:=0\ I* the root region */ 
while (level I contains a region p to be partitioned) 
globally optimize level £\ 
for (each region p to be partitioned) 

partition region p into subregions p' and p"; 
generate new constraints for p' and p"; 
endfor 
^ := 1 ; 
endwhile 

3. Global Optimization 

In each global optimization phase, a constrained quadratic optimization prob- 
lem (CQOP) is derived from the circuit topology (the netlist) and the geometry 
of the rectangle dissection on the respective level. The solution of the CQOP is a 
global placement of all modules that induces the assignment of module subsets 
to subregions to be created in the next partitioning phase. 

Let us first introduce some definitions: The topology is described by the bi- 
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nary relation T C 9\[x where and ^^{ are the index sets of the nets and 
the modules, resp., with N = \9{} and M = \fM\. A connection of net v to 
module fi is represented by (v,/t) eT. By T we denote the topology matrix 

T(ArxM) = M> = I J’ otherwise ^ coordinates of the modules and 
the nets are (xn,y^) and (xv,yv). respectively. 

The objective function (|) of the CQOP is the weighted sum of the squared dis- 
tances between modules and nets: 



= 0 S S + {yn + Tlv/i - yv)^] • V • Wv , (1) 






where are the coordinates of a pin relative to the center of its module 

fjL, and Wv is the weight of net v specified by the designer. Since 



=<I>W +<!>()’)> (2) 



we consider (])(x) synonymous for either term of (2). Writing (1) in matrix form, 
we get 

(|)(x) = ^x'Cx + d'x-l-const. (3) 



with the coordinate vector x' = [xj^,,x^] of the N nets and the m movable mod- 
ules, and C the positive definite connectivity matrix. The vector d originates 
from the contributions of theM — m fixed modules (the pad cells) and the rela- 
tive pin coordinates. 

The centers of the q regions on the level of partitioning form the constraints 
on the m movable modules 

A(^)X;„ = , (4) 



with the vector the center coordinates of the regions. The entries 

of describe which module (occupying units of area) belongs to which 

region p: 




fli/Yiiifu 

0 



if lie iMf, 
otherwise. 



( 5 ) 



Putting together the topological (3) and the geometric (4) demands, the problem 
to be solved is 



nun {(|)(x) = jx'Cx-bd'x | A^^^X;„ = b^^^}. (6) 



Since (|)(x) is a convex function (C is positive definite) and the linear equality 
constraints (4) are convex too, the problem (6) has the unique global minimum 
(|)(x*). 

The advantage of simultaneous optimization over the optimization in local mod- 
ule subsets is motivated by the fact that modules influence each other, even if 
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they belong to distant regions. Thus, the placement of the whole circuit is opti- 
mized in its entirety. 

To solve problem (6), various methods can be applied like a fixed point iteration 
scheme similar to the one described in [7] or other optimization algorithms [11]. 
In this context, iterative solution methods outperform direct solvers, since the 
system matrix C has to be set up only once for all CQOP’s and sparsity is fully 
exploited. But most important, the result of the CQOP of level ^ is a good start- 
ing solution for the CQOP of level i+1. Furthermore, the increasing number 
q of constraints in (4) more and more determines the result of the optimization. 
Therefore, the number of iterations needed to solve (6) decreases rapidly with 
higher levels i. 

4. Partitioning 

After each global optimization a new partitioning phase is started. For each 
region p with > 2, is sorted according to the x- (y-) coordinates of its 
modules, if the region will be cut vertically (horizontally). Then, is divided 
into iMpi and such that the sums of the module areas of both subsets are 
approximately the same. The rectangular area of region p is dissected corre- 
spondingly. 

For the column-oriented layout style of standard-cells, the chip area is first dis- 
sected by vertical cuts until all modules are assigned to columns. Then, the 
placement is refined within the columns by making horizontal cuts. 

For the general-cell layout style with rectangular modules of different widths 
and heights, horizontal and vertical cuts alternate from one level to the next, 
until < k, where ^ > 2 is a predefined constant. For these regions an ex- 
haustive slicing optimization is performed. 

5. Exhaustive Slicing Optimization 

The area utilization of slicing structures can be optimized efficiently using 
the concept of shape Junctions [14, 10, 18]. 

For general-cell circuits, an exhaustive slicing optimization procedure (ESO) is 
proposed, combining enumeration of slicing subtrees with global optimization 
and evaluation of shape functions. 

For small subsets of modules, it is not too expensive to enumerate all possible 
subtrees. For k = \lMp\ = 2 modules, there are t{2) = 2 subtrees; t(3) = 6, 
t(4) = 22, . . . However, considering the possible permutations of modules for 
each of the t{k) subtrees, t{k)-k\ placements have to be evaluated for each subset 
5Wp. Therefore, k = 4 represents the practicable limit in most implementations 
of this approach. 

By exploiting the result of the global optimization of section 3 the factor k\ of 
computation time can be saved. This allows us to apply ESO to larger subsets 
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of modules (approximately 7). The allocation of modules to regions is derived 
from the coordinates of the modules determined in the last global optimization 
phase. This results in short wire length, since it preserves the relative positions 
of the global placement. 

Combining ESO with the evaluation of shape functions means to calculate 
the combined shape function of all t(/Mp) subtrees for each region with Wp = 
\%\<k modules and recursively that of the root. Thus, all possible shapes of 
all exhaustively optimized regions simultaneously contribute to the final result. 
This allows the designer to choose from a menu the best shape of the whole chip 
with the shortest estimated wire length. 

6. Complexity of the Method 

Space complexity: The connectivity matrix C for both the x- and y-coordina- 
tes is stored in a list structure with 0{m-\-N A-P) memory space, where m, N , 
P are the numbers of movable modules, nets and pins, resp. For larger circuits, 
P and N grow proportionally to m. The constraint matrix A can be stored in a 
vector of length m (cf. (5)). The slicing tree has 2m — \ nodes. Thus, the space 
complexity of the CQOP and the partitioning phases is 0{m). 

Time complexity: Each iteration step in the global optimization takes time pro- 
portional to (A-l-m-l-P), which is 0{m). The number of iterations to solve the 
CQOP can be limited by a constant much smaller than m. The partitioning of q 
regions takes time proportional to ^ ^ • log ^ which is 0(m • logm). A balanced 
slicing tree has logm levels. Thus, the total time complexity of global optimiza- 
tion and partitioning is 0(m • log^m). 

The space and time complexity of ESO depends on the number of different mod- 
ule shapes and therefore varies from one circuit to the other. 

7. Experimental Results 

In this section, we present experimental placement results in terms of esti- 
mated wire length, chip area and computation time. The cpu times were ob- 
tained on a DEC Micro VAX II computer running ULTRIX-32. GORDIAN is 
written in C. 

Characteristics of benchmark circuits from [15] are listed in table 1. In tables 2 
and 3, three methods are compared for the standard-cell circuits Primary! and 
Primary!: (a) min-cut with terminal propagation [6, 4], where the best result 
obtained from several randomly created starting solutions is shown, (b) RT: 
the relative placement / transportation method from [7], and (c) Gordian. 
The modules were placed in 17 and 26 columns, respectively, with estimated 
inter-column channel widths of 220jtun and 270jum. Wire length is measured 
as half perimeters and minimum spanning trees with squared Euclidian metric 
(E^-MST), summed over all nets. The E^-MST measure is highly correlated 
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circuit 


layout style 


modules 


nets 


pins 


Primaryl 


standard-cell 


752 


904 


2941 


Primary2 


standard-cell 


2907 


3029 


11226 


AMI33 


general-cell 


33 


123 


522 



Table 1. Characteristics of the benchmarks. 



algorithm 


half perimeter [m] 


e 2-MST ( X 10’) 


min-cut 


1.739 


2.263 


RT 


2.177 


3.626 


Gordian 


1.503 


1.706 



Table 2. Standard-cell benchmark Primary 1. 



algorithm 


half perimeter [m] 


e 2-MST ( X 10’) 


min-cut 


9.823 


2.104 


RT 


8.685 


1.739 


Gordian 


8.142 


1.502 



Table 3. Standard-cell benchmark Pnmao'2- 



ESO parameter k 


wire length [mm] 


chip area [mm^] 


cpu time [s] 


1 


76.37 


1.787 


38.0 


2 


73.40 


1.594 


43.6 


3 


72.88 


1.499 


40.6 


4 


72.40 


1.475 


53.7 


4a 


82.22 


1.449 


104.1 


5 


72.40 


1.475 


53.7 


6 


1626 


1.449 


57.9 


7 


76.26 


1.449 


58.1 



Table 4 • General-cell benchmark AMI 33. 



The general-cell circuit AMI33 was placed with varying values of the ESO 
parameter k (cf. section 5) without extra wiring space. The results with mini- 
mal area from the Gordian menu are presented in table 4 (for ^ = 5, see also 
figure 1). The table shows the effect of exhaustive slicing on both chip area and 
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wire length scored as half perimeters, summed over all nets. The values of entry 
4“ were obtained by permuting the modules for subsets with |5Wp| < 4 rather 
than exploiting the global optimization result. The ESO procedure of section 5 
clearly performes superior when considering all factors. 

Figure 2 shows the computation time of Gordian versus circuit size for several 
circuits. Evidently, the experiments conform to the expected time behaviour, al- 
lowing the solution of much larger placement problems even on a workstation. 

8. Conclusions 

An important aim of current research on layout techniques is the development 
of placement strategies that emphasize a global view, since this is difficult to ob- 
tain for a human designer particularly when designing large chips. 

Gordian, a new placement method preserving the simultaneous treatment of 
all modules (cells) through all phases of the algorithm has been presented. For 
general-cell circuits, it provides the designer with a menu of near-optimal place- 
ments in terms of the probably conflicting parameters chip area, aspect ratio and 
estimated wire length, from which he can make his choice. Excellent results 
have been obtained with respect to placement quality and computation time. 
First experiments with sea-of-gates circuits have shown that Gordian is well 
suited for this layout style, too. 



2000 - 



1500- 

cpu 

seconds 

1000 - 



500- 






0 



.•'X* 

X 

^ X 

T“ 

1000 




2000 3000 

number of modules 



Figure 2. GORDIAN: computation time vs. circuit size. 
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Abstract 

An exact zero skew clock routing algorithm using Elmore delay model is presented. Recursively 
in a bottom-up fashion, two zero-skewed subtrees are merged into a new tree with zero skew. 
The algorithm can be applied to single-staged clock trees, multi-staged clock trees, and multi- 
chip system clock trees. It is ideal for hierarchical methods of constructing large systems. All 
subsystems can be constructed in parallel and independently, then interconnected with exact zero 
skew method. 



1. Introduction 

We propose in this paper an exact zero skew clock routing algorithm for tim- 
ing performance optimization of synchronous digital systems. Clock skew is 
defined as the maximum difference of the delays from the clock source to the 
clock pins on latches. Optimization of the clock skew can dramatically reduce 
system’s cycle time, and hence the timing performance. In contrast, improper 
clock skew may sometimes cause clock hazard and system malfunction [2]. The 
following equation summarizes the relationship of the clock period P, clock 
skew s, worst-case data path delay dmax and other offset constant Pg, for the 
condition of proper timing. 

P = 5 dffiax Po 

Note that P<, is a constant that includes data set up time, latch active time, and 
other possible offset factors such as timing safe margins. The latch active time 
is the lag time for the data to be latched in after the latch is triggered by a clock 
signal. 

It is clear from the equation that in order to reduce the cycle time, P, it is 
necessary to minimize the skew s, besides the minimization of the worst-case 
data delay dmax on the combinational logics. As interconnection delay is be- 
coming more dominating and design size is becoming larger, the clock skew is 
also becoming more significant in terms of performance optimization. 

Many heuristics have been proposed in the past for clock routing. H-tree 
structures are most widely used, especially in systolic array designs. A gener- 
alization of H-tree that hierarchically divides at median and connects the mean 
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points is proposed in [3]. A further improvement is done by bottom-up pair 
wise connections, which construct a perfect length balanced tree [4]. However, 
all these heuristics focus only on wire length balancing, rather than the real 
objective as balancing clock delay. In contrast, what we propose is an exact 
algorithm that balances the clock delays and takes into account uneven loading 
and buffering effects. 

The outline of this paper is as follows. We first study how to compute signal 
delays efficiently on an RC tree. An RC tree is a connected, acyclic, undirected 
graph with each branch associated with a resistance value and each node asso- 
ciated with a capacitance value. 

Next, we discuss how a clock tree is modeled as an RC tree for delay analyses. 
In general, clock trees are classified into two types. The first type is a single- 
staged clock tree that all clock pins are driven directly from a clock source. 
In order to reduce phase delays (the maximum delay from the clock source to 
a clock pin) and to supply sufficient driving currents, usually several level of 
buffers are added to create a multi-staged clock tree. Thus the second type is 
called multi-staged clock trees that the clock pins are driven from intermediate 
buffers, and the buffers are driven by either other buffers or the clock source. 
A multi-chip system clock tree is basically a multi-staged clock tree except that 
the clock pins are scattered on many chips (or cards). 

Then the zero skew algorithm is presented. Based on a lumped delay model 
and the delay computation method, we found that any two zero-skewed subtrees 
can be merged into a tree with zero skew by tapping the connection to a specific 
location of each subtree. Basically, it is a recursive bottom-up algorithm. 

Finally, we present experimental results of the zero skew algorithm, and com- 
parisons with a wire length balancing heuristics [4]. 

2. Linear Time Hierarchical Delay Computation 

We adopt the conunonly used Elmore delay model [5] to calculate the signal 
traveling time from a clock source to each clock pin. We modify the hierarchical 
method proposed in [5] and have a hierarchical method for computing delays in 
a bottom-up fashion, which is the key to our zero skew algorithm. 

We first define some terms. Let T represent an RC tree with every node asso- 
ciated with an index. We assume the index of the root is always 0. A predecessor 
of node i is a node resides on the unique path between the root and node i, but 
excluding node i itself. An immediate predecessor of node i is a predecessor of 
node i with no other nodes between them. Similarly, successors of node i is the 
set of nodes with node i as one of their predecessors. An immediate successor 
of node i is a successor of node i with no other nodes in between. The root is a 
node with no predecessor. Leaf nodes are nodes with no successors. A subtree 
Ti is defined as a subtree of T formed by the node i and its successors. Since T is 
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a tree, there is only one unique edge between a node and its predecessor. So we 
simply define branch i as the edge between node i and its immediate predecessor. 

Let c; be the node capacitance of node i and r,- be the resistance of branch i. 
To simplify discussion, we set r, = 0 for root node i. We define IS{i) as the set 
of all immediate successors of node i. Then the total subtree capacitance Q of 
Ti is defined recursively as 

Ci = a+ X (1) 

keis(i) 

The above equation suggests that the subtree capacitance can be computed 
in a depth first search manner. The capacitance of the subtree rooted from a 
node can be computed from its own node capacitance and the summation of the 
subtree capacitance of its immediate successors. Hence a recursive bottom-up 
algorithm can be used to compute the subtree capacitance of each node. 

To calculate the delay, we first define N as the collection of all nodes on 
tree and N{i,j) as the collection of nodes on the path between node i and node 
j, excluding node i but including node j . The delay to a leaf node i can be 
calculated by the following formula 

to/ ~ ^ 

neN(0,i) 

As a generalization, we can compute the “delay time” between any two nodes 
i and j by the following formula, assuming is a predecessor of j. 

Uj — ^nCn 

n&N(U) 

It can be shown easily that if i is an intermediate node between node k and 
node j, then 

Ik j ~ tki "b t| j (2) 

In case that node k is the root (i.e. ^ = 0) and node i is the immediate prede- 
cessor of node j, then we have 

tOj^toi + fjCj (3) 

since there is only one edge between nodes i and j, and tij = rjCj. This equation 
suggests that we can easily calculate the delay from the root to any leaf node 
in one depth first search. The delay time to each node can be derived from 
its immediate predecessor, the branch resistance and the subtree capacitance. 
Recursively, in a top-down fashion we compute the delay time to each node. 

Since the time complexity of the depth first search algorithm is linear in num- 
ber of edges [1] and the number of edges is the number of nodes minus one for 
a tree, we easily have the following theorem. 

Theorem 1 The delay time from the root to each node on an RC tree can be 
computed in linear time. 
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Generalization to Buffered RC Trees 

To handle multi-staged clock trees (or buffered clock trees), we generalize 
the previous delay computation method for a buffered RC tree. Before we define 
what is a buffered RC tree, we first discuss the equivalent circuit model of a 
clock buffer as shown in Fig. lb. We specifically designate the input node of 






db 


/ 






Cb 



rb 



(b) 



B 

•O 



db'. buffer internal delay 

Cb- buffer input capacitance 

r*: buffer output driving capacitance 

A: buffer input node 

B: buffer output node 



Figure 1. (a) A clock buffer, (b) An equivalent model. 



a buffer as a buffer input node, which is important for delay calculation. The 
box in Fig. 1 represents a delay element with db as the buffer internal delay and 
is connected to the buffer input node on one end and the buffer output-driving 
resistor rb on the other end. The buffer input capacitor Cb is on the buffer input 
node, and the buffer output-driving resistor rb is connected to the delay element 
and buffer output node. One function of buffers is to supply enough currents for 
driving latches. The other function of buffers is for creating stages such that the 
subtree capacitance of the buffer output node will not be carried over, i.e. the 
equivalent total subtree capacitance as seen at the buffer input node is only Cb- 
Usually, the buffer driving resistance and input capacitance are designed to be 
small values. This is why buffering usually reduces delay time. 

To account for the buffering effects, we define a buffered RC tree just like a 
normal RC tree except that each branch i is now also associated with a branch 
delay di besides the branch resistance r/. The branch delay is always equal 
to zero except the case that it stands for a buffer. The basic delay calculation 
presented previously is modified as the following for buffered RC trees. 

The calculation of the equivalent subtree capacitance at node i is now depend- 
ing on whether node i is a buffer input node or not. Eq. (1) has to be modified 
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as the following for computing the subtree capacitance of a buffered RC tree. 



^ _ f Ci if node i is a buffer input node 

' Ci + lkeis(i) Ck otherwise 

We also extend the delay computation for a node i and its successor j as the 
following equation in order to accommodate the now branch delay situation, i.e. 

neN(iJ) 



Thus Eq. (3) is modified to be 



% — ^Oi + ^ jCj -i-dj 

for delay calculation of a buffered RC tree. 

3. Delay Computations of Clock trees 

We shall discuss in this section how to model a clock tree as a buffered RC 
tree so that we can perform delay computation efficiently. Each clock tree real- 
ization is consisted of wiring segments, clock pins, and clock buffers. Hence it 
is necessary to know the equivalent RC model of each component. 

Equivalent Tt-model for a Distributed RC line 

Distributed RC lines are more accurate for characterizing circuit performance 
of wiring segments. A distributed RC line is usually represented as the symbol 
shown in Fig. 2a. Either a 7 i-model (Fig. 2b) or a T-model (Fig. 2c) is used to 
represent the equivalent circuit of the distributed RC line. 

Throughout this paper, we will use the equivalent 7i-model for analysis. An 
input node, an output node and a branch between both nodes are used to repre- 
sent the equivalent n-model of a wire segment. Let R be the total wire resistance 
and C the total wire capacitance. Then the equivalent input and output node 
capacitances are all equal to C/2, and the equivalent branch resistance is R. 

Equivalent Buffered RC Tree of a Clock Tree 

We use a generic example as shown in Fig. 3a. to illustrate how to con- 
struct an equivalent buffered RC tree from a multi-staged clock tree. For this 
particular example, we assume a clock source is driving a buffer through wire 
1, and the buffer is connected to the clock pin on a latch through wire 2. The 
driving resistance of the clock source is assumed to be r^. Both wire segments 
1 and 2 are represented by equivalent Ji-models as discussed earlier. The buffer 
is transformed to an equivalent circuit with buffer input capacitance c/,, buffer 
delay db and buffer output driving resistance r*. The end clock pin of the latch 
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Figure 2. (a) A distributed RC line, (b) The equivalent TC-model. (c) The equivalent T-model. 
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Figure 3. (a) A generic multi-staged clock tree, (b) The equivalent buffered RC tree. 

is associated with a loading capacitance c/. The equivalent buffered RC tree is 
as shown in Fig 3b. 

Lumped Delay Model 

To make the presentation of the zero skew algorithm easier, we shall introduce 
a lumped delay model of a subtree. Recall Eq. (2) tkj = fe + hj. Suppose i is an 
immediate successor of k, and j is a leaf node. Then 



hj = di-VriCi + tij 



(4) 
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Consider node i as the root of the subtree 7/. To compute the delay time one 
level up, from node k to node j, we need only to know the branch resistance r,s 
the branch delay d,-, the subtree capacitance Q and the delay time tij from the 
root of Ti to the node j according to Eq. (4). 

Thus we propose an equivalent lumped delay model of the subtree Ti for 
simplifying the delay computation. In the equivalent circuit, the subtree 7} is 
replaced by an input capacitance Q and a branch delay ty from input node i to 
leaf node j. We will use this lumped delay model for developing the algorithm 
in the next section. 



4. Zero Skew Algorithm 

The zero skew algorithm is a recursive bottom-up process. We describe only 
one recursive step. Repeat the process in a bottom-up fashion will construct a 
complete zero skew clock tree. 




1— 1-x— H 



Figure 4- Zero-Skew-Merge of two subtrees. 



We assume every subtree has achieved zero skew, which means the signal 
delays from the root of the subtree to its leaf nodes are equal. This is obvious if 
the subtree contains only one leaf node, and it serves as our starting point of the 
algorithm. 
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To interconnect two zero-skewed subtrees with a wire and ensure zero skew 
of the merged tree, the problem to be solved is the decision of where on the wire 
will be the new root of the merged tree, such that the delay time from this new 
root to all leaf nodes are equal, i.e. zero skew. We will call this new root point 
on the wire as a tapping point, and this process as the zero-skew-merge process. 

Let us discuss the example shown in Fig. 4 with subtree 1 and subtree 2. 
First, assume the lumped delay model of each subtree is as shown in Fig. 4. 
The tapping point separate the interconnection wire of the two subtrees into two 
halves (which may not be equal). Each half-wire segment is represented as a 
n-model as shown. To ensure the delay from the tapping point to leaf nodes of 
both subtrees being equal, it requires that 

ri{ci/2 + Ci)+t\ = r2{c2/2 + C 2 )+t 2 (5) 

according to Eq. (4). Note that r\, and ci are for wire segment 2. There are no 
branch delays. 

We assume that the total wire length of this interconnection wire segment is /. 
The wire length from the tapping point to the root of subtree 1 is x x / and hence 
the wire length from the tapping point to the root of subtree 2 will be (1 -x:) x /. 

Let a be the resistance per unit length of wire and P be the capacitance per 
unit length of wire. Then we have r = ftl,r\= axl, r 2 = d(\—x)l. Also, c = p/. 
Cl = P^Z, C 2 = P(1 -x)l. 

Hence, after solving Eq. (5), we find that the zero skew condition requires 
that 

(t2-ti)+aZ(C2-bPZ/2) 

^ aZ(pZ-bCi-fC2) 

If 0 < x < 1, the tapping point is somewhere along the segment interconnect- 
ing the two subtrees and is legal. In case that jc < 0 or x > 1, it indicates the 
two subtrees are out of balance. The interconnection wire has to be elongated. 
For simplicity, we discuss only the case that x: < 0. For this case, the tapping 
point has to be exactly on the root of subtree 1 in order to minimize total in- 
terconnection length. Assume the elongated wire length is I'. The distributed 
resistance value is al' and the distributed capacitance value is PZ'. To determine 
a minimum elongated wire length Z', it requires 

tl = ^2 "b cd'{C2 + PZV2) 
or 

, _ [^J { 0 C 2 Y + 2ccp(fi — ^ 2 )] — CtC2 
ap 

Similar results can be obtained for the case jc > 1. It is worthwhile noting that 
the uneven loading effect is naturally taken care of by this approach. 
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Wire “snaking” is a common practice for wire elongation. Since it is the 
nature of a clock-wiring algorithm to balance two subtrees, the snaking should 
not occur often. 

In case that two subtrees are out of balance and the elongation severely af- 
fect the wirability then addition of buffers, delay lines, or capacitive termina- 
tors should be considered based on the same balancing principle. For instance 
the capacitance value, say C,, of a capacitive terminator to be attached on the 
root of subtree 2 for the case ac < 0, can be determined by solving the equation 
t\ = t 2 + ot/(C 2 4-P//2-l-C,), or we have C, = (ti -t 2 )l{cd) - (C 2 + P// 2 ). 

Before presenting the algorithm formally, we define a few more related terms. 
The number of stages of a clock tree is defined as the maximum number of 
clock buffers on a path from the clock source to a clock pin, with the clock 
source counted as a buffer. A cluster is the collection of a clock buffer and 
its associated clock pins. Each cluster is tagged with a stage number, which 
is exactly the number of buffers on the path between the clock source and the 
clock buffer of the cluster. The number includes the clock source and clock 
buffers of the cluster. In summary, we have the following linear time zero skew 
clock routing algorithm. 

Algorithm 4.1 (Zero Skew Algorithm) 

SI: Let s - number of clock tree stages. 

S2: Ifs = 0, report results and exit; continue, otherwise. 

S3: For each cluster in stage s, do 

S3.1: Treat each clock pin in the cluster as a tapping point. Repeat steps 
S3. 2 and S3. 3 until there is only one tapping point left. 

S3. 2: Pair-up tapping points. 

S3. 3: For each pair, perform zero-skew-merge of two subtrees and deter- 
mine new tapping point, using the algorithm discussed in this sec- 
tion. If only one point left in the group, then do nothing. 

S3. 4: Connect the last tapping point directly to the clock buffer output 
node. 

S4: Let 5 = 5—1. Continue from S2. 

The zero skew algorithm does not depend on the algorithm used for group- 
ing the clock pins or tapping points into pairs. For any pairing algorithm, the 
zero skew algorithm will work fine. However, to optimize wirability, minimum 
weighted matching algorithm maybe better. Or a more efficient algorithm that 
alternately partitions the clock pins into two equal numbered groups can be used. 
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For implementation in real environment, we have to consider blockages and 
the difference of electric constants on different layers. The connection between 
any two tapping points can be done by any existing wiring algorithm that handles 
wiring blockages. The tapping point is then found by searching through each 
wiring segment of different electric constants. 

To minimize the total wire length, we may construct a few possible wiring 
patterns (ex. two one-bend connections) between each pair of tapping points, 
and pick up the one that gives shorter length at the next higher level pairing 
process. 



f D = (5,15),c = 2F 



F(5,ll) o 

• 

C=(0,10),c=lF 




snaking 



a = O.lQ/unit 
p = 0.2F^nit 



| £,G = (10,6) ^ 

B=(22,6),c=10F 



• A = (8,0),c= 16F 



Figure 5. A zero skew wiring result of a simple example. 



Example: An example with four clock pins (Fig. 5) is used to illustrate the 
algorithm. Pin A is at (8,0) with 16F loading capacitance. Pin B is at (22,6) 
with lOF capacitance. Pin C is at (0,10) with IF. Pin D is at (5,15) with 2F. 
Per unit resistance is 0. IQ, and per unit capacitance is 0.2F. Pin A and B are 
in one pair and C, D in other pair. According to the algorithm, a tapping point 
E is decided to be on (10,6) so that the delays to both A and B are all equal 
to 13.44ns. Similarly, a tapping point F is located at (5,11) for connection of 
pins C and D, with equal delay 0.96ns. The two subtrees rooted by E and F are 
VERY unbalanced. We find that x = -0.175 < 0. The wire connecting E and F 
has to be elongated by 8.28 units, and the tapping point G has to coincide with 
E. The final wiring result is shown in Fig. 5. Note that the connection between 
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(A, B) and (C, D) are chosen from the two one-bend connections of each pair 
for shorter wire length between (E, F). 

5. Experimental Results 

We test our algorithm on five different sized examples. The statistics of the 
examples are shown in Table 1. The chip width and height units are both in 
1/10 microns. We assume per unit resistance is 3m£2, and per unit capacitance 
is 0.02fF. The loading capacitances of clock pins are ranging from 30fF to 80fF. 
For simplicity, we assume all are one-stage clock trees, i.e. no intermediate 
clock buffers. All experiments are conducted on an IBM 3090 machine. 



Examples 


rl 


r2 


r3 


r4 


r5 


No. Pins 


267 


598 


862 


1903 


3101 


Chip width 


69984 


94016 


97000 


126970 


142920 


Chip height 


70000 


93134 


98500 


126988 


145224 



Table 1. The Statistics of Testing Examples. 



Algorithm 


Zero Skew 


Length Balancing 


Examples 


Phase delay 
(ns) 


Skew (ns) 


Runtime (s) 


Phase delay 
(ns) 


Skew (ns) 


rl 


1.799 


0 


0.1 


1.798 


0.132 


r2 


4.631 


0 


0.3 


5.367 


0.806 


r3 


7.055 


0 


0.5 


7.655 


0.702 


r4 


20.666 


0 


1.2 


23.316 


3.558 


r5 


35.918 


0 


2.0 


38.958 


1.931 



Table 2. A Comparison Between the Zero Skew Algorithm and a Wire Length Balancing 
Heuristic. 

We use a simple heuristic for pairing up clock pins in this experiment. We 
recursively partition the pins into two equal (or almost equal) halves by the 
median of the sorted pin list in alternate horizontal and vertical directions. This 
heuristic creates a binary tree for each example. Then the pins are connected 
based on the zero skew algorithm. For comparison, we also implement a wire 
length balancing heuristic [4] on the same binary tree. The results are shown in 
Table 2. It is needless to say that the zero skew algorithm is very important for 
eliminating the clock skew, especially as for large chips. 

6. Conclusions 

We have presented a novel zero skew clock routing algorithm based on El- 
more delay calculation. The approach is ideal for constructing large systems. It 
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can be modified for customized skew (for cycle stealing) and multi-phase clock 
systems [6]. We expect this clock routing algorithm will be widely used for 
performance enhancement for synchronous VLSI digital systems. 
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Abstract 

We consider the problem of bipartitioning a circuit into two balanced components that minimizes 
the number of crossing nets. Previously, Kemighan and Lin type (K&L) heuristics, simulated 
annealing approach, and analytical methods were given to solve the problem. However, network 
flow techniques were overlooked as viable approaches to min-cut balanced bipartition due to their 
high complexity. In this paper we propose a balanced bipartition heuristic based on repeated max- 
flow min-cut techniques, and give an efficient implementation that has the same asymptotic time 
complexity as that of one max-flow computation. We implemented our heuristic algorithm in a 
package called FBB. The experimental results demonstrate that FBB outperforms K&L heuristics 
and analytical methods in terms of the number of crossing nets, and our efficient implementation 
makes it possible to partition large circuit netlists with reasonable runtime. For example, the 
average elapsed time for bipartitioning a circuit S35932 of almost 20K gates is less than 20 
minutes on a SPARCIO with 32MB memory. 



1. Introduction 

Circuit partitioning is a fundamental problem in many areas of VLSI layout 
and design, such as floorplanning, placement and multi-chip, multi-FPGA par- 
titioning. Min-cut balanced bipartition is the problem of partitioning a circuit 
into two disjoint components with equal weights such that the number of nets 
connecting the two components is minimized. The min-cut balanced biparti- 
tion problem was shown to be NP-complete [11]. Because of its importance, 
many heuristic algorithms have been devised for its solution. Among the well- 
known heuristics are the following [6] Kemighan and Lin type (K&L) iterative 
improvement methods [20], [9], simulated annealing approaches [19], and ana- 
lytical methods for the ratio-cut objective [25], see e.g., [15], [4], [23]. 

The well-known network max-flow min-cut theorem [8], [22], [7], [10], [16] 
is an important combinatorial optimization technique. It has many applications 
in VLSI design such as linear placement [3], min-cut replication [13], [14], and 
FPGA technology mapping [2], [27]. The network max-flow min-cut technique 
is in fact the most natural method for finding a min-cut in a graph. However, 



Authors are currently with ’Intel Corp. and ^Department of Electrical and Computer Engineering, University 
of Illinois at Urbana-Champaign. 



522 



THE BEST OF ICC AD 



it was overlooked as a viable approach for circuit partitioning due to the fol- 
lowing reasons: (1) The two components obtained by the network max-flow 
min-cut technique are not necessarily balanced, (2) Although a balanced cut 
can be achieved by repeatedly applying min-cut to the larger component, this 
method can possibly incur n max-flow computations, where n is the size of flow 
network, making it impractical for large problem instances. (3) The traditional 
network flow technique works on graphs, but hypergraphs are more accurate 
models for circuit netlists than graphs. 

In this paper we explore solutions to the above problems faced by the tradi- 
tional network flow technique. We first propose a method for exactly modeling 
a netlist (or equivalently, a hypergraph) by a flow network, and a balanced bi- 
partition heuristic based on a repeated max-flow min-cut technique. We then 
give an efficient implementation that has the same asymptotic time complexity 
as that of one max-flow computation. 

We use a generalized notion of the balanced bipartition, the r-balanced bi- 
partition (also used in [9]), which is a bipartition such that one component is of 
weight a fraction r of the total weight W. As a special case when r = 1/2, an 
r-balanced bipartition is a balanced bipartition. Since in practice there is little 
reason to strictly enforce the r-balanced criterion, we introduce a deviation fac- 
tor e to allow the component weight to deviate from (1 - e)rW to (1 -l-e)rW. 
We show in Theorem 3.2 that both the runtime and the cut size produced by our 
algorithm are decreasing functions of e. This kind of direct relationship was not 
shown in previous partitioning heuristics. 

The rest of this paper is organized as follows. In Section 2, we first present a 
method for exactly modeling a netlist by a flow network, and an optimal algo- 
rithm for finding a min-net-cut bipartition (not necessarily balanced) of a circuit 
with respect to a source and a sink. This algorithm serves as a basic proce- 
dure for our min-cut balanced bipartition heuristic. We also show that the most 
r-balanced min-cut bipartition problem is NP-complete. We then present our 
heuristic algorithm for finding a min-net-cut r-balanced bipartition based on the 
repeated network flow technique in Section 3 with an efficient implementation. 
We compare our balanced bipartition results with those of K&L heuristics and 
analytical methods in Section 4, and conclude the paper in Section 5. 

2. Optimal Min-Net-Cut Bipartition 

We first give some definitions of a flow network. A flow network G= (V,E) 
is a directed graph in which each edge eeE has a capacity c(e) > 0. Two nodes 
s and t in V are specified: s is called the source, t is called the sink. Figure 1(a) 
shows an example of a flow network and a max-flow. The label x/y on an edge 
indicates that the flow and the capacity on the edge are jc and y respectively. The 
dark edges are the forward edges of the corresponding min-rut (X,Z). An s-t 
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flow (or flow for short) in G is a real-valued function f:E->^R such that (1) for 
all e G 0 < f{e) < c(e), and (2) for all m G \ the sum of the incoming 
flow into M is equal to the sum of the outgoing flow from u. An edge e in £ is 
saturated if /(e) = c(e). The value |/1 of a flow / is defined as the sum of the 
flow outgoing from s, which is equal to the sum of the flow incoming to t. A 
maximum-flow (or max-flow for short) in G is a flow of maximum value from s 
to t. 

An s-t cut (or cut for short) (X , X) of a flow network G = ( V, £) is a bipartition 
of V into X and X such that 5 G X and / G X. An edge whose starting node is in 
X and ending node is in X is called di forward edge. An edge whose ending node 
is in X and starting node is in X is called a backward edge. The capacity of the 
cut (X,X), denoted by cap{X,X), is the sum of the capacities on ths forward 
edges only from X to X. An augmenting path from « to v in G is a simple path 
from M to V in the undirected graph resulting from the network by ignoring edge 
directions, that can be used to push additional flow from u to v. 

Theorem 2.1 Max-flow min-cut theorem [8] 

Given a max-flow f in G, let X = {v eV : 3 an augmenting path from s to v in 
G}, and letX = V\X. Then (X,X) is a cut of minimum capacity ( which is equal 
to 1/1), and f saturates all forward edges from X to X. 



Figure 1. (a) A flow network G, and a max-flow / in G. (b) A digraph N representing a 

seqential circuit and its net-cuts. 

2.1 Modeling a Net in a Flow Network 

We represent a sequential circuit netlist as a digraph N = (V,E) where V is 
a set of nodes representing combinational gates and registers, and £ is a set of 
edges representing interconnections between gates and registers. Each node v in 
has an associated weight w(v) G R'^. The total weight of a subset U CV is 
denoted by w{U). Let W = w(V) denote the total weight of the circuit N. 

A net n= (v;vi, . . . ,v/) is a set of outgoing edges from node v in N. For 
example in Figure 1(b), net a consists of two edges (/‘ijgi) and (/•i,g2)- Given 
two nodes s and t in N, an s-t cut (or cut for short) (X,X) of N is a bipartition 
of the nodes in V such that s €X and t G X. The net-cut net{X,X) of the cut is 
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the set of nets in N that are incident to nodes in both X and X. A cut (X,X) is 
a min-net-cut if \net{X,X) \ is minimum among all s-t cuts of N. In Figure 1(b), 
net{X,X) — {b,e), net{Y,Y) = {c,a,b,e}, and (X,X) is a min-net-cut. 

In order to find a min-net-cut in = (V,E), we reduce it to the problem of 
finding a cut of minimum capacity, and then solve the latter problem by the max- 
flow min-cut theorem. If the cut edges all have unit capacity, then the problem 
is equivalent to finding a cut with the minimum number of forward edges from 
XtoX. 

We construct a flow network N' = (V',E') from JV = (V,E) as follows (see 
Figures 2 & 3): 

1. V' contains all nodes in V. 

2. For each net n = (v; vi , . . . , v/) in N, add two nodes wi and «2 in V 
and a bridging edge (n\,n 2 ) in E' . 

3. For each node m e {v, vi , . . . , v/} incident on net n, add two edges 
(«,ni) and (nziu) in E'. 

4. Let 5 be the source of N' and t the sink of N'. 

5. Assign unit capacity to all bridging edges and infinite capacity to all 
other edges in E'. 

6. For a node v € V' corresponding to a node in V, w(v) is the weight 
of V in N. For a node u£V' split from a net, w{u) = 0. 





The nodes and edges correspond to net n in N’ 



Figure 2. Modeling a net in N in the flow network N', 

Note that all nodes incident on net n are connected to n\ and are connected 
from «2 in bl'. Hence the flow network construction is symmetric with respect 
to all nodes incident on a net. We show in Lemma 2.1 that the size of N' is only 
a constant factor larger than the size of N, for a connected graph N. 

Lemma 2.1 Let N' = {V',E') be the flow network constructed from a digraph 
N = (V,E) using the above method Then |V'| < 3|V| and \E'\ < 2|£^| -|-3|Vl. 

The above flow network construction for modeling net-cuts also works when 
the circuit N is represented by a hypergraph. We note that another optimal ap- 
proach for finding a min-net-cut of a hypergraph was given in [16] by modeling 
a net (or a hyperedge) as a star node, and then transforming a node-capacitated 
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A bridging edge with unit capacity An ordinary edge with infinite capacity 

Figure 3. A flow network N' constructed from the circuit N in Figure 1(b). 

flow network into an edge-capacitated network [22] by splitting every node. Our 
method is different from that of [16] in that we split the nodes corresponding to 
the nets only, and hence use fewer nodes and edges in the resulting flow net- 
work. For a huge input circuit, our method translates into less memory usage 
and faster runtime. 

Another related result given in [18] shows that modeling hypergraphs by 
graphs (with positive weights) with the same min-cut properties is not possible. 
However, our method models a hypergraph for network flow based partition al- 
gorithms only. The differences between the model used in [18] and our model 
are the following: (1) The weight of a cut in [18] is computed by the sum of all 
cut edges with fixed weights, while the weight of a cut in our model is computed 
by the sum of the capacities of the forward edges only. (2) [18] tries to model 
hypergraphs for a wide range of existing partition algorithms developed for or- 
dinary graphs only, while our method just models a hypergraph for network flow 
based partition algorithms. Hence our method is able to exactly model a hyper- 
graph for our flow based partitioning algorithm without contradicting the result 
in [18]. 

It is easy to see that N' is a strongly connected digraph. The strong connec- 
tivity of N' is the key to reducing the bi-directional min-net-cut problem to the 
minimum capacity cut problem that counts the capacity of the forward edges 
only. We show in Theorem 2.2 and Corollary 2.2. 1 that the problem of finding a 
min-net-cut in N can be reduced to the problem of finding a cut with minimum 
capacity in N'. 

Theorem 2.2 N has a cut of net-cut size at most C if and only ifN' has a cut of 
capacity at most C. 

Corollary 2.2.1 Let {X',X') be a cut of minimum capacity C in N', and let 
(X,X) be the cut in N as constructed in Theorem 2.2 (2). Then {X,X) is a 
min-net-cut in N, and \net{X,X)\ = C. 
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2.2 Optimal Network Flow Based Min-Net-Cut 
Bipartition 

Based on Theorem 2.2 and Corollary 2.2.1, we give an optimal algorithm 
for finding a bipartition (not necessarily balanced) of a circuit N = {V,E) with 
respect to two nodes s and t that minimizes the number of crossing nets. 

Algorithm 1: Finding A Min-Net-Cut 

0. Construct the flow network N' = {V',E') for N as 
described in Subsection 2.1; 

1. Find a max-flow in N' from s to t; 

2. Find a cut {X',X') of minimum capacity in N' as 
described in the max-flow min-cut theorem; 

3. Find a min-net-cut (X,X) in as described in 
Theorem 2.2 (2). 

Theorem 2.3 Algorithm 1 finds a min-net-cut in a circuit N = (V,E), and ter- 
minates in 0(|V||£|) time where A is a connected circuit. 

Proof; The correctness of Algorithm 1 has been established in Theorem 2.2 
and Corollary 2.2.1. 

Steps 0, 2 and 3 take linear time in the size of N. We use the simple augment- 
ing path algorithm [8] to implement step 1. Finding an augmenting path in N' 
takes 0{\E'\) time. The number of augmenting paths in N' is equal to the num- 
ber of bridging edges in a min-net-cut, which is at most IV] (usually much less). 
Hence Algorithm 1 takes 0{\V\\E'\) = 0(|V|(|£:| + |K|)) = 0(lV||£'|) time by 
Lemma 2.1. Q 

There are other asymptotically faster (worse-case) algorithms for finding a 
max-flow than the simple augmenting path algorithm based on the Ford and 
Fulkerson method. The fastest preflow method takes 0(|£||F| log(|Vp/|£|)) 
time [12] with a large constant factor. The Ford and Fulkerson method takes 
0{\E\* max-flow- value) time. The latter method is efficient in our application, 
since the max-flow- value in the special flow network we construct is at most |V| 
(usually much smaller than |F|). 

2.3 Most r-balanced Min-Net-Cut Bipartition is 
NP-complete 

The min-cut bipartition may yield unbalanced components. The max-flow 
computation defines a set of min-cuts with the same cut size, but with varying 
weights in the two partitions. It is natural to ask the question of whether one 
can find a min-cut that is the most r-balanced among all the min-cuts defined 
by a max-flow, i.e., among all possible min-cuts (X,X) defined by a max-flow 
find the min-cut such that |w(X) — rw(F)| is as close to 0 as possible. One can 
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show that the decision version of the problem is NP-complete by reducing the 
weighted subset sum problem [11] to it. 

3. Min-Cut Balanced Bipartition 

It is not difficult to see that repeatedly applying the max-flow min-cut tech- 
nique to cut the larger of the two partitions will eventually produce a balanced bi- 
partition with a natural small net-cut. However, this approach was overlooked as 
a viable heuristic approach to circuit partitioning due to its high complexity (pos- 
sibly |V| max-flow computations). In this section we first describe a repeated 
max-flow min-cut heuristic algorithm, Flow-Balanced-Bipartition (FBB), for 
finding an r-balanced bipartition that minimizes the number of crossing nets. 
We then give an efficient implementation of FBB that has the same asymptotic 
time complexity as one max-flow computation. For ease of presentation, we will 
describe our algorithm in terms of the original circuit and net-cuts, instead of the 
flow network constructed from the circuit (as shown in Section 2.1) and forward 
bridging edges, when there is no confusion. An illustration of Algorithm 2 is 
shown in Figure 4. 

3.1 Balanced Bipartition Heuristic 

Given a circuit N = (V,E), FBB randomly picks a pair of nodes s and t in 

N, and then tries to find an r-balanced bipartition that separates s and t, and 
that minimiz es the number of crossing nets. Let W be the total weight of the 
circuit N. Since in practice there is little reason to strictly enforce the r-balanced 
criterion, we allow the component weights to deviate from (1 - e)rW to (14- 
e)rW. Given a subcircuit X of N, let w(X) denote the total weight of nodes in 
X. 

Algorithm 2: Flow-Balanced-Bipartition (FBB) 

O. Randomly pick a pair of nodes s and t in N\ 

1. Find a min-net-cut C in N; 

Let X be the subcircuit reachable from s through 
augmenting paths in the flow network, and X the rest; 

2. If (1 - e)rW < w(X) < (1 + e)rW 
then stop and return C as the answer. 

3. Ifw(X)<(l-8)rW 

then 3.1. collapse all nodes in X to s; 

3.2. collapse to 5 a node veX adjacent to C; 

3.3. goto 1; 

4. Ifw(X) > (l4-e)rW _ 

then 4.1. collapse all nodes in X to t; 

4.2. collapse to t a node veX adjacent to C; 

4.3. goto 1. 
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Figure 4- An illustration of Algorithm 2 for r = 1/2, E = 0. 15 and unit weight for each node. 
Hence a component can have weight between 5 and 7. If e = 0.45, then a component can have 
weight between 3 and 10, and Algorithm 2 would terminated after finding cut (X 2 ,X 2 ). A small 
solid node indicates that the bridging edge corresponding to the net is saturated with flow. 



Step 1 can be implemented by Algorithm 1. In step 3.2, we need to collapse a 
node V € X incident on a cut net to s since otherwise the same set of nets in C will 
again be chosen as the min-net-cut in the next iteration in step 1. The reasons 
why we adopt this node collapsing method instead of a more gradual method 
(i.e., increasing the unit capacities of the bridging edges by a fixed amount as 
in [17] are (1) the capacity of the cut would no longer reflect the real net-cut 
size, and (2) the runtime would not be bounded by one flow computation. By 
collapsing v to s, FBB is able to explore a different net-cut with a larger X in the 
next iteration. Note that the size of the min-net-cut found in the next iteration 
will be the same as or larger than the size of the min-net-cut in the current 
iteration. A similar argument holds for step 4.2. 

We now describe our strategy for picking a node in steps 3.2 and 4.2. To 
find an r-balanced bipartition that minimizes the net-cut size, our heuristic is to 
always focus on finding a min-net-cut at each iteration. But when the remaining 
circuit is very large, the current min-net-cut has less influence on what the final 
balanced min-net-cut would be. Therefore, we randomly pick a node in steps 
3.2 and 4.2 in order to speed up the algorithm. When the remaining circuit 
becomes small enough, we need to be more careful about which node we pick, 
and we can afford to try out more than one node. We give a threshold value R for 
the number of nodes in the un-collapsed subcircuit. If the number of remaining 
nodes is larger than R, then we randomly pick one node from the nodes incident 
on the cut nets in C. Otherwise, we try all nodes incident on the cut nets in C 
and pick the node whose collapsing induces a min-net-cut with the smallest size. 
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We can also let the probability of choosing a node be inversely proportional to 
the number of nodes in the remaining (un-collapsed) circuit. 

3.2 Efficient Implementation of Algorithm 2 

A drawback of the repeated max-flow heuristic is that it has a relatively high 
time complexity. Iteratively applying Algorithm 1 in step 1 of Algorithm 2 to 
compute a max-flow and a min-net-cut from the zero flow can be very time- 
consuming. We show an efficient way to deal with this problem. In fact, it 
is not necessary to do the max-flow computation from the zero flow in every 
iteration. Instead, we can retain the flow value in the flow network, and only find 
additional flow to saturate the bridging edges of the min-net-cut from iteration 
to iteration. 

In Procedure 1, we describe the incremental max-flow computation in step 1 
of Algorithm 2. Initially, the flow network retains the flow function computed 
in the previous iteration. 

Procedure 1: Incremental Flow Computation 

0. While 3 an additional augmenting path from s to t 

1. increase flow value along the augmenting path; 

/* There is no more augmenting path from s to t.*/ 

2. Mark all nodes u such that 

3 an augmenting path from s to m; 

3. Let C' be the set of bridging edges whose starting 
nodes are marked and ending nodes are not marked; 

4. Return the nets corresponding to the bridging edges in 
C' as the min-net-cut C, and the marked nodes as X. 

Since the max-flow computation using the augmenting path method is insen- 
sitive to the initial flow values in the flow network and the order in which the 
augmenting paths are found, the above procedure correctly finds a max-flow 
with the same flow value as a max-flow computed in the collapsed flow network 
from scratch (i.e., the zero flow). 

We show in Theorem 3.1 that if we fix the threshold R used in the node- 
picking strategy described in the previous subsection as a constant, then the 
total complexity of FBB is 0(|F||E|), which is the same as the complexity of 
one max-flow computation. 

Theorem 3.1 If Procedure 1 is used to implement step 1 of Algorithm 2, then 
Algorithm 2 has time complexity 0{\V\\E\) for a connected circuit N = (V,E). 

Proof: Since each augmenting path computation takes 0(|£|) time, we prove 
that the total time complexity of step 1 of Algorithm 2 is 0(|F||£|) by show- 
ing that there are at most 2|F| augmenting path computations in the following 
iterations. 
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The total flow value |/| in the flow network N' constructed from N at the end 
of Algorithm 2 is the number of forward bridging edges in the final min-net- 
cut. Hence |/1 is at most |V^|. Since bridging edges have unit capacity, there are 
1/1 < s-t augmenting paths found at the end of Algorithm 2, We now consider 
an augmenting path computation in step 0 of Procedure 1. Either an augmenting 
path is found, in which case the numW of augmenting paths increases by 1, 
or at least one node will be collapsed to s or t in steps 3.1 and 3.2 or 4.1 and 
4.2 of Algorithm 2. Hence the number of augmenting path computations in the 
following iterations is at most 2|V^|. 

Note that step 3 of Procedure 1 can be accomplished during the searching for 
an augmenting path in step 2 of Procedure 1, and steps 4 and 5 of Procedure 1 
takes 0(1 V|) time in the worst case. Q 

In practice, as shown in the experimental results. Algorithm 2 terminates 
much faster than the 0(|Vl|E|) worst case time complexity. Because of the 
construction of the flow network in Section 2.1 where the bridging edges have 
unit capacity, the number of augmenting paths found in Algorithm 2 is the same 
as the size of the net-cut found in N, which is much less than |V|. 

Theorem 3.2 The number of iterations and the final net-cut size of Algorithm 2 
are non-increasing functions of Z. 

Proof: Fewer iterations are needed in Algorithm 2 when e is larger, since the 
condition in step 2 of Algorithm 2 is satisfied in fewer iterations. 

If an augmenting path from 5 to t is found in step 1 of Procedure 1, then the 
flow value is increased by at least 1 and hence the size of the min-net-cut is 
increased by at least 1. If an augmenting path from j to t is not found in step 1 
of Procedure 1, then the size of the min-net-cut is equal to the flow value of the 
previous iteration, which is equal to the previous min-net-cut size. Hence the 
net-cut size found in each iteration is non-decreasing. Q 

Theorem 3.2 guarantees that with a larger e deviation factor we can improve 
the efficiency of Algorithm 2 and obtain a better partitioning solution. This prop- 
erty is not true for other partitioning approaches such as the K&L heuristics. An- 
other interesting corollary of Theorem 3.2 is that the longer the execution time 
of Algorithm 2, the worse the net-cut size in the final solution. This property of 
Algorithm 2 can be utilized to further improve the efficiency of Algorithm 2. 

4. Experimental Results 

We implemented Algorithm 2 in a package called FBB using the C language, 
and integrated FBB in SIS/MISII [1]. Currently FBB runs on circuit formats 
accepted by SIS/MISII. We tested FBB on a set of large ISCAS and MCNC 
benchmark circuits using a SPARC 10 workstation with a 36Mhz SSIO and 
32MB memory (the C code was compiled with gcc without the optimizer). For 
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30.0 
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1:1.10 


24.5 


19.0 



Table 1 . Comparison of bipartition results of SN, PFM3, and FBB (with r = 1 /2 and E = 0. 1). 



each circuit tested, the number of gates and latches, the number of nets, and 
the average net degree (i.e., the average number of nodes connected to a net) are 
given in Tables 4 and 1 . Note that the actual number of nodes in a circuit includes 
PI nodes, and is therefore more than the the number of gates and latches. 

Table 4 compares the average bipartition results of FBB with the results re- 
ported by Dasdan and Aykanat in [5]. The program SN is based on the K&L 
heuristic algorithm in Sanchis [24], which is a generalization of the Krishna- 
murthy [21] algorithm. The program PFM3 is based on a K&L heuristic with 
free moves as described in [5]. SN was run 20 times and PFM3 was run 10 
times on each circuit starting from different randomly generated initial parti- 
tions, while FBB was run 10 times on each circuit from different randomly gen- 
erated s and t as the source and the sink respectively. Table 4 shows that with 
only one exception, FBB outperforms both SN and PFM3 on the 5 circuits. On 
average, FBB finds a bipartition with 24.5% and 19.0% fewer crossing nets than 
SN and PFM3 respectively. This is not too surprising since max-flow min-cut 
techniques tends to find a natural small cut. The average actual ratios of the two 
partitions obtained by FBB are also shown in Table 4. Since we set e = 0.1, the 
actual ratios of the two partitions are roughly the same (1:1.10 on average). 

We did not compare the runtime of SN, PFM3, and FBB since they were 
run on different workstations. SN and PFM3 were run on a SUN SPARC ELC, 
and FBB were run on a SUN SPARC 10. For example, for C3540, the aver- 
age elapsed time (not CPU time) in seconds of SN, PFM3, and FBB for each 
run are 90.3, 71.0, and 13.6 respectively; and for C7552, the average elapsed 
time in seconds of SN, PFM3, and FBB for each run are 44.3, 81.8, and 18.8 
respectively. 

Table 1 compares the best bipartition net-cut size of EIGl (Hagen and Kahng 
[15]), PARABOLI (Riess, Doll, and Frank [23]), and FBB. EIGl and PARABOLI 
are two programs based on analytical methods and their results were obtained 
from [23]. The results produced by PARABOLI were the best previously known 
results reported on the benchmark circuits. The results for FBB were the best of 
10 runs. The elapsed time of FBB for the run that generates the best result was 
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Table 2. Comparison of bipartition results of EIGl, PARABOLI (PB) and FBB (with r=l/2 
and e = 0.1). All results allow up to 10% deviation from bisection. 



also recorded. All results in Table 1 allow up to 10% deviation from bisection. 
On average, raB outperforms EIGl and PARABOLI by 58.1% and 11.3% re- 
spectively. For circuit S38417, FBB produces a larger net-cut than PARABOLI 
does. We consider the following possible explanations: 1) If FBB is run more 
than 10 times, the best net-cut result is likely to be better. 2) In a huge circuit 
like S38417, the solution is sensitive to the selection of the initial s,t pair of 
nodes. Applying circuit clustering techniques based on the connectivity infor- 
mation before partitioning may improve the partitioning result of FBB. 

Note that different programs using the same MCNC benchmark circuits re- 
ported different properties such as the number of cells and the number of nets 
for these circuits. This is because when a netlist format is translated to a hy- 
pergraph, some unnecessary details such as inverters are omitted. However, the 
underlying netlist structures are the same. 

In the experiment, we have consistently observed that for the runs with longer- 
than-average runtime, FBB always generates exceptionally poor solutions. This 
can be explained by Theorem 3.2, since the net-cut size is non-decreasing with 
more iterations. This property of FBB is in contrast to both the K&L heuristics 
and the simulated annealing heuristics, where longer runtime means better so- 
lutions. This property of FBB provides another way of improving the efficiency 
of FBB. We can pick a reasonable upperbound for the runtime of FBB (for ex- 
ample, based on a few runs of FBB), stop FBB when the runtime exceeds the 
upperbound, and restart FBB using a new pair of nodes s and t. By doing so 
we are not likely to lose any good solutions, but we will further improve the 
efficiency of FBB. 

5. Conclusions and Discussions 

We have described a method for exactly modeling a netlist by a flow net- 
work, presented a balanced bipartition heuristic based on the repeated max-flow 
min-cut techniques, and given an efficient implementation of a good theoret- 
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ical method. We implemented our algorithm in a package called FBB. The 
experimental results demonstrate that the repeated max-flow min-cut heuristic 
outperforms K&L heuristics and analytical methods in terms of the number of 
crossing nets, and the efficient implementation enables our heuristic algorithm 
to partition large benchmark circuits with reasonable runtime. 

FBB has predictable behavior in terms of the sizes of the two partitions, and 
the direct relationship between efficiency, solution quality of FBB, and relaxing 
the r-balanced criterion by using a larger e. Such a direct relationship was not 
shown in previous heuristics for circuit partition. We also believe that the choice 
of the pair of nodes s and t as the initial configuration of FBB has less influence 
on the solution than an initial bipartition would have. Hence the solution quality 
of FBB is less sensitive to the initial choice of s and t. 

Our algorithm can be easily extended to handle that case where the nets in 
a circuit have different weights. We can simply assign the weight of a net to 
its corresponding bridging edge in the flow network, and FBB will find a net- 
cut with its weight minimized. K-v/ay min-cut partitioning for > 2 can be 
accomplished by recursively applying FBB, or by setting r= If K and then using 
FBB to find one partition at a time. We are currently investigating more natural 
methods based on flow networks for the K-v/ay min-cut partitioning problem. 

Pre-partitioning circuit clustering according to the connectivity or the timing 
information of the circuit can be easily incorporated into FBB by treating a 
cluster as a node. A possible extension to FBB would be to combine FBB with 
the K&L heuristics and the simulated annealing heuristics. We can use FBB 
to find a natural small net-cut as a good initial partition, and then apply the 
K&L heuristics or the simulated annealing heuristics with low temperature to 
fine-tune the solution. 
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Abstract 

The first and the most critical stage in VLSI layout design is the placement, the background of 
which is the rectangle packing problem : Given many rectangular modules of arbitrary size, place 
them without overlapping on a layer in the smallest bounding rectangle. Since the variety of the 
packing is infinitely many (two-dimensionally continuous), the key issue for successful optimiza- 
tion is in the introduction of a P-admissible solution space, which is a finite set of solutions at 
least one of which is optimal. This paper proposes such a solution space where each packing is 
represented by a pair of module name sequences. Searching this space by simulated annealing, 
hundreds of modules could be successfully packed as demonstrated. Combining a conventional 
wiring method, the biggest MCNC benchmark ami49 is challenged. 



1. Introduction 

Layout in physical design of VLSI is, simply to say, to pack all the circuit 
elements in a chip without violating the design rules, so that the circuit per- 
forms well and the production yield is high. So much is the variety of targets in 
different stages, the problem defined as follows is the base of all of them. 

Rectangle Packing Problem: RP 

Let (M he a. set of m rectangular modules whose height and width are given 
by real numbers. (Orientation is fixed.) A packing of fW is a non-overlapping 
placement of the modules. The minimal bounding rectangle of a packing is 
called the chip. Find a packing of in a chip of the minimum area. 

A packing of six modules is shown in Fig.l. 



Authors are currently with 'MicroArk Co., Ltd., ^Tokyo University of Agriculture and Technology, and 
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RP can be shown to be NP-hard by reducing an NP-hard problem which is 
RP with a constraint that the width of the chip is fixed [1]. 

Since the height and width of modules are continuous real numbers, RP is not 
simply a combinatorial optimization problem. Hence there have been several 
numerical approaches [2, 3]. 

An alternative approach is “combinatorial search”. Define a solution space 
which is a system of codes and a mapping from each code to a solution which 
is a packing in the case of RP. Each code represents a constraint imposed on 
packing. A code is said to he feasible if the constraint is consistent, i.e. there ex- 
ists a packing that satisfies the constraint represented by the code. The mapping 
defines a consistent packing for a code if it is feasible. The evaluation of a code 
is the minimum area of the chip of the corresponding packings. The combinato- 
rial search is to search for a better code in the solution space. If a trade-off to the 
computation time is observed, the heuristics will stop the search on the way and 
output the one best so far. Being effective this search, the minimum requirement 
to the solution space is 

( 1 ) The size of the code set is finite, 

(2) Every code is feasible, 

(3) The mapping, as well as the evaluation in our case, for each code is possible 

in polynomial time, and 

( 4 ) The best evaluated code in the solution space coincides with an optimal 

solution of the original problem, that is RP in our case. 

The solution space that satisfies the above four requirements is called P- 
admissible. 

The reasons for (1),(3) and (4) are obvious. That for (2) is: most heuristics 
pick up one solution after another along the neighboring structure defined on the 
space [4], consulting with the difference of evaluations (gain) to the previous 
solution. Therefore, if infeasible solutions are included, the continuity will be 
destroyed and the convergence to a feasible solution is not guaranteed. 

A practically known solution space is the one derived from slicing floorplan 
proposed by Often [5] and others. Since it satisfies (1), (2) and (3) several op- 
timization heuristics are applied for the space, and one of the most successful 
approaches uses simulated annealing [6]. Since the optimal solution can be non- 
slicing, it lacks (4). Efforts have been paid to add non-slicing structures [7, 8]. 

On the other hand, Onodera et.al. [9] uses a solution space by assigning one 
out of four relations, “left-of ’ , “right-of ’, “above”, “below”, to every pair of 
modules. This space satisfies (4) since any packing satisfies a combination of 
the relations. But there are many infeasible codes such as; module a is left- 
of module b, b is left-of module c and c is left-of a. As a consequence, the 
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space does not admit such a heuristics as simulated annealing. In their paper, 
an exhaustive search with a branch-and-bound technique is applied to find an 
exactly optimal solution, but the size of tractable problems is limited up to six 
modules. Thus these two are not P-admissible. 

This paper provides a P-admissible solution space, in which each code is 
an ordered pair of module name sequences, which we call the Sequence-Pair. 
Searching this space, we have been able to pack hundreds of modules very effi- 
ciently, almost optimally at a look as in Fig.9 and in Fig. 10. 

Utilizing this solution space of RP for VLSI layout design, the evaluation of 
a packing has to be modified to consider wires. Many formulae have been pro- 
posed for this purpose [6, 9], and we follow [9]. The largest MCNC building- 
block benchmark is successfully placed by simulated annealing in about 30 
minutes(Fig.ll). 

This paper is organized as follows. In Section 2, a mapping from a given 
packing to a pair of module name sequences is given, to show that an optimal 
solution is included. Section 3 provides a procedure of an inverse mapping from 
a sequence pair to a packing. Section 4 demonstrates how the space can be 
utilized in placement problems. Section 5 is for concluding remarks. 

2. From A Packing to A Sequence-Pair 

Let n be a packing on chip C. See Fig.l for an example. 



W 




1 



H 



Figure 1. A packing in a chip whose area is HxW. 
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2.1 Gridding 

A floorplan is a partition of C into rectangles, called rooms, such that a room 
contains at most one module. A room which contains no module is said to be 
empty. 

The line segments forming the room boundaries (including four sides of 
C) are called the cutting-segs. We assume that a cutting-seg, except for four 
sides of C, terminates at a midpoint of an orthogonal cutting-seg (forming a T- 
intersection). It is trivial that such a floorplan is always possible, although not 
necessarily unique. 

In the following, we describe a procedure to get a pair of module name se- 
quences from a packing. 

procedure: Gridding(II) 

Obtain one arbitrary floorplan and fix it. (See Fig.2 which is an example floor- 
plan corresponding to n in Fig.l.) Take a non-empty room. Put a pebble p at 
the center of the room. Move it rightward until it hits the cutting-seg which is the 
right side of the room. Then, move p upward until it hits an orthogonal cutting- 
seg. Then, move it rightward until it hits an orthogonal cutting-seg, and continue 
turning its direction as right, up, right, up, • • •, until reaching the upper right cor- 
ner of the chip. The locus of pebble p is called the right-up locus of the module. 
Similarly, up-left locus, left-down locus, and down-right locus are defined. (Fig. 3 
shows these four loci of one module.) 

The union of right-up locus of module x and left-down locus of module x is 
called the positive locus (since it passes inside the 1st and 3rd quadrants). Analo- 
gously, the union of the up-left locus of x and down-right locus of x is called the 
negative locus. For every module, one positive locus and one negative locus are 
uniquely defined. They are referred to by the corresponding module names. (An 
example with all loci is shown in Fig.4.) 




Figure 2. A floorplan of a packing. 



Figure 3. Loci of module b. 
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Figure 4- Positive loci(left) and negative loci(right), resulted in (r+,r_) = {abdecf,cbfade). 



Theorem 1 No pair of positive loci cross each other. No pair of negative loci 
cross each other. (They may run along the same cutting- segs, but not cross each 
other. ) 

Proof: Let two modules be a and b. Since positive loci of a and b cannot 
be inside the other room, a crossing, if any, would occur outside their rooms. 
Denote the right-up locus of module a by RU(a). Similar notation is applied for 
the other three types of loci. 

Case 1: Suppose that RU(b) comes from below and hits RU(a) at a point p\. 
See Fig.5(case 1 ). Since RU(a) and RU(b) are along cutting-segs, RU(b) cannot 
cross RU(a) at p\ by definition of the cutting-seg. After p\, the two may run 
along together for a while. Since they are following the same rule of right-up 
locus, they run together and never cross each other. Hence, right-up loci of a 
and b do not cross. By the same reason, left-down loci of a and b do not cross. 







pi i 
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asel) 
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Figure 5. Loci used in the proof of Theorem 1 . 
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Case 2: Suppose that RU(b) comes from below and hits LD(a) at a point 
P 2 - See Fig.5 (case 2). After p 2 , RU(b) goes right upstream along LD(a) for a 
while. Then RU(b) reaches to the point where LD(a) comes from above. After 
that point, RU(b) continues to go right and thus goes below LD(a) again. Since 
RU(b) can not go inside the room of a, it goes below the room of a. Hence, 
left-down locus of a and right-up locus ofb do not cross. By the same reason, 
right-up locus of a and left-down locus ofb do not cross. 

Then, the positive loci of a and b do not cross. Similarly, negative loci of a 
and b do not cross. 

The implication of the theorem is significant: all the m positive loci are lin- 
early ordered, and so are the negative loci. Here we order the positive loci from 
the upper left, and order the negative loci from the lower left. Since each locus 
is uniquely referred to by the module name, we have obtained an ordered pair 
of module name sequences (r+,r_), which we call the Sequence-Pair, where 
r+(resp. r_) is a module name sequence which represents the order of positive 
(resp. negative) loci. 

In Fig.4, positive loci are in order “abdecf’ and negative loci are in order 
“cbfade”, then (r+,r_) = (abdecf,cbfade) is obtained. 

Given packing Ft, the corresponding Sequence-Pair is not unique due to the 
arbitrariness in the fixing the floorplan. Let the one obtained by the procedure 
be (r+,r_), which we denote as Griddingfll). 

2.2 Geometrical Information of Sequence-Pair 

Let Gridding(n) = (F+ , F_ ) . For a module x, any other module x! is uniquely 
one of four cases, x! is after/before x in F+/r_. Let us define four classes, ac- 
cordingly. 

M^(x) = {y I y is after x in both T+ and T_ }, 

(jc) = {y I y is before x in both T+ and T_ }, 

M^^{x) = {y I y is before X in r+ and, after a: in F_}, 

= {;!/ I y is after jc in r_|. and before jc in F_ }. 

For example, with respect to (F+,r_) = [abdecf,cbfade), four subsets for 
module are: M^(b) = {d,e,f}, 9^^{b) = 0, 9v[^^{b) = {a}, and 9v[^(b) = 
{c}. Any module other than x belongs to a unique subset, and it is trivial that 
two modules are in a dual relation through x •<4 V, and a h as: 

y e 9\/[^^(x) X G 
x! G ?lF^^(x) 4^ X G 

Let us formally define the terminology “left-of”, “right-of ’, “above” and “be- 
low”. In a packing, if the left side of module x is right-of the right side of module 
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y, X is said to be right-ofxf. Similarly, left-of, above, below relations between 
two modules are defined. These notations follow [9]. 

Theorem 2 Let Gridding(n) = (r+,r_). Ifx! € thenxf is right-ofx 

in n. 

The claim holds replacing the pair of words and “right-of") with 

any of and ‘‘left-of"). and ‘‘above"), and(‘‘lM^^" and ‘‘be- 

low"). 

An example is shown. In Fig.3, modules d,e,f are in 9v[^^{b), and they are 
right-of b in 11. 

Proof: We sketch the proof taking an example of Fig.3. Pick arbitrary two 
modules, b and f. The loci ofb divide the chip into four regions. Among them, 
the region surrounded by the right-up locus ofb, down-right locus of b, and the 
right side of the chip is called the right-cone ofb. Similarly, the left-, above-, 
and below-cone denote other three regions. 

Suppose f is in M^{b). This implies that the positive locus of f is in the 
union of the right-cone and below-cone ofb. Also it is implied that the negative 
locus of f is in the union of the right-cone and above-cone of b. The cross 
point of the positive and negative locus of f is in their intersection, that is, in 
the right-cone ofb. Then, in thefloorplan generated in Gridding(Tl), the center 
point of the room off is in the right-cone ofb. Hence the module f is also in the 
right-cone ofb. All the modules in the right-cone ofb must be right-of module 
b by definition of right-up locus and down-right locus ofb. 

Similarly, the claim holds for the other cases. 

3. From A Sequence-Pair to A Packing 

In the previous section, we analyzed the packing, and fixed the way “grid- 
ding” to get one sequence-pair from a given packing. Now we synthesize one 
packing from an arbitrary sequence-pair. 

3.1 (r+,r_)-Packing 

Let (r+,r_) be an arbitrary sequence-pair. We define a geometrical con- 
straint derived from (r+,r_). 

GeomConst(r + , L_ ) 

For every two modules x and xl, ^ must be right-of in n if V € 9d^{x). This is 
also the constraint with replacing the pair of words and “right-of”) with 

any of and “left-of”), and “above”), and and “below”). 

A packing O is called {r+,T-) -packing if 11 satisfies GeomConst(r+,r_). 
Corollary 1 : There is a (r+,r_) -packing if (r+,r_) is the sequence-pair 

obtained by Gridding. □ 
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However, we can prove the following fact. 

Theorem 3 For every sequence-pair (r+,r_), there is a -packing. 

Proof: Consider an mxm grid where m is the number of modules. Label the 
horizontal grid lines and vertical grid lines with module names along and 
T- from top and from left in order, respectively. A cross point of the horizontal 
grid line of label x and the vertical grid line of label V is referred to by [x,x'). 
Then, draw the resultant grid on a plane rotating 45 degrees. (See Fig. 6.) Put 
each module x with its center being on (x,jc). Expand the separation of grid lines 
enough to eliminate overlapping of modules. (Actually, the expansion is enough 
if it is y/l times larger than the longest width/height over modules.) Trivially, the 
resultant packing satisfies the requirement implied by the given sequence-pair. 

An example is shown in Fig.6 which corresponds to (r+,r_) = (abdecf, 
cbfade). 




Figured. A (r+,r_)-packing 
(r+,r_) = (abdecf , cbfade). 

3.2 (r+,r_)“ Optimal Packing 

Given (r+,r_), an optimal packing of (r+,r_) -packings is said (F+,r_)- 
optimal. A (r+,r_)-optimal packing can be obtained in 0(mf) time by apply- 
ing the well-known longest path algorithm for vertex weighted directed acyclic 
graphs. The graphs can be made as follows. 

Based on the “right-of ’ constraint of (F-l-, F_), a directed and vertex-weighted 
graph, called horizontal-constraint graph Gh(V,E), is constructed as follows. 

V : source s, sink t, and m vertices labeled with module names 
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E : (5, a:) and (x, t) for each module x, and (x,x!) iff x and x! are in the constraint 
“x must be left-of V”. 

Vertex-weight : zero for s and t, width of module x for the other vertices 

Similarly the vertical-constraint graph Gv{V,E) is constructed using “below” 
constraint and the height of each module. 

Both graphs do not contain any directed cycle. Furthermore, for every pair 
of modules, there is always an edge in Gh or in Gy, and not in both. This is 
because the order relation of any two modules in a Sequence-Pair uniquely de- 
fines one of the horizontal or vertical relation between them. From this fact, 
the X-coordinate and the Y-coordinate of the lower left comer of each module 
can be determined independently to satisfy the constraint in the direction, and 
the resultant placement is guaranteed not to contain any overlap. Then, the X 
and Y coordinates of each module are determined as the minimum by assigning 
the longest path lengths between the source and the node of the module in Gh 
and Gy, respectively. Similarly, the width and the height of the chip are deter- 
mined as the longest path lengths between the source and the sink in Gh and Gy, 
respectively. Since the width and the height of the chip is independently min- 
imum, the resultant packing is (r+,r_)-optimal. The longest path calculation 
can be done in 0{m^) time, proportional to the number of edges in the graph. 

As an example, Gh and Gh are shown in Fig.7 for (r+,r_) = {abdecf, 
cbfade). The resultant placement after longest path length calculation is shown 
in Fig.8. 




Figure 1. Constraint graphs G//(left) and Gv(right) (transitive edges are not drawn for sim- 
plicity). 



3.3 The P-admissible Solution Space 



Previous discussions conclude: 
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Figure 8. A (r+,r_) -optimal packing 
for (r+,r_) = {abdecf,cbfade). 



Theorem 4 The set of all sequence-pairs is a P-admissible solution space of 
RP. More precisely, it consists of{m\)^ sequence-pairs, each of which can be 
mapped to one specific -optimal packing in 0{m^) time. And at least 

one of these packings is an optimal solution of RP. 



Our discussion started for minimizing the area of the chip. However, all 
the theorems except Theorem 4 do not mention about the evaluation. While, 
Theorem 4 holds for any evaluating function as long as it is independently non- 
decreasing both for the width and height of the chip. Therefore we may assume 
instead, for example, the perimeter of the chip, area of chip of the pre-specified 
aspect ratio, and height of the chip when width of the chip is fixed. This fact 
will expand the applicability of our solution space. 

It has been assumed that the orientation of each module (vertically laid or hor- 
izontally laid) is fixed. When the orientation is also requested to be optimized, 
we hold a {0, 1} sequence of length m, expressing the orientation of each mod- 
ule being horizontal or vertical. The solution space is enlarged to the size of 
(m!)^2'”. (Even the orientation optimization for a fixed floorplan is also known 
to be NP-hard [10].) This technique can be easily extended for so called “soft” 
modules, by preparing three or more candidates of (width, height) pairs for one 
module [11]. 

4. Experiments 

To show the usefulness of the proposed solution space, we first show its po- 
tential in packing. Then, an application example in VLSI layout will be given. 




545 



Physical Design 

4.1 Rectangle Packing 

We extracted the dimensions of 146 modules from a printed circuit board 
in an industry example. These modules are packed by a simulated annealing 
method by the move (transformation, perturbation) of a solution based on three 
operations of pair-interchanges of: (i) two module names in r+, (ii) two mod- 
ule names both in T+ and F_, and (iii) the width and the height of a module, 
where the last one is for orientation optimization. The initial sequence-pair is 
given by assuming = T_, which corresponds to a one-horizontal-row pack- 
ing. The temperature is decreased following a quite standard annealing schedule 
but from a heuristic point of view, operation (i) is selected with higher probabil- 
ity in higher temperature, and operation (iii) is selected with higher probability 
in lower temperature. 

The result is shown in Fig.9. Computation on Sun SparcStationll stopped in 
29.9 minutes reaching the terminating temperature. The algorithm has searched 
not greater than 606,192 distinct sequence-pairs out of the solution space of size 
(1461)^2''^^ ~ 1.23 X 10^^^. Notice that, only the search of a fraction about 
4.92 X 10“^'*^ of the solution space was enough to obtain Fig.9. As a challenge, 
we tried 500 pieces, using 18.83 hours to get the result shown in Fig.lO. 




Figure 9. Packing of 146 modules. 




Figure 10, Packing of 500 modules. 



4.2 Module Placement With Wire Consideration 

For VLSI placement problems, we extend the evaluation to consider wires. 
Among various possible evaluations about wires, here we focus the final chip 
area. In the following, we demonstrate a method that minimizes the chip area 
including wiring spaces. 
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Assume (r+,r_) be a sequence-pair, n be a (r+,r_)-optimal packing, and 
W and H be the width and height of 11. Terminals are given as fixed points on 
the boundaries of each module. A net is a set of terminals(multi-terminal net), 
which must be connected by wires, later in detailed routing phase. A set of nets 
is given as the netlist For net i, the width and the height of the smallest 
bounding box of the terminals are denoted Wi and Hi, respectively. T is the 
sum of wire width and spaces between wires. We use the following formulae 
to estimate the final chip width W' and height H', which are the ones proposed 
in [9]. 



W' 

H' 



W + T 






H 



H + T 



W 



The second term in the right-hand side of each formula estimates the amount 
of increase in one direction owing to the wires, assuming all wires are uni- 
formly distributed in the final chip. They experimentally showed that the resul- 
tant placement is acceptable for the later detailed routing stage. 

As for the variety in self-symmetric placement of each module, there are 
totally eight choices per a module, which is the combination of four choices 
of {0,90, 180,270} degree rotations, and two choises of {yes,no} value of the 
reflection about Y axis. In our system, this code for orientation and a sequence- 
pair are put together into a simulated annealing process, which runs in a similar 
fashion as rectangle packing optimization. The process searches the solution 
space of size (m!)^8'". 

The point which is not mentioned in [9] is how the location of each individual 
module is calculated. After the best evaluated code is obtained, coordinates of 
each module is determined as follows. Assume (Xj,Yj) is the coordinates of the 
lower left comer of module j in FI, which is the information we can use in this 
phase. Let 9\(xj be a set of nets such that the X coordinate of left side of bounding 
box of the net is less than or equal to Xj. Similarly, is defined using Yj. We 
determine the coordinates (Xj,Yj) of the lower left comer of module j in the 
resultant chip by the following formulae. 



^ + 



The biggest building block layout data, called “ami49”, is taken from the 
MCNC benchmarks, and placed by the above described method with additional 
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constraint: aspect ratio = 1. For authors’ knowledge, it has not been dealt with- 
out hierarchically dividing the problem to reduce the problem size. The result is 
shown in Fig. 11. Computation time was 31.36 minutes on SunIPX. 




Figure 11. Placement of MCNC “aini49”. 



5. Concluding Remarks 

The motivation of this work was our experience that many VLSI designers 
are not satisfied with slicing structures. This paper has achieved a breakthrough 
by introducing a P-admissible solution space to the rectangle packing problem, 
which is fundamental to the layout design. 

Experiments suggest that hundreds of rectangles can be packed very effec- 
tively in reasonable time. The biggest MCNC benchmark data, ami49, is now 
tractable without hierarchically dividing the problem, by utilizing the proposed 
solution space. 

As experiments revealed, the search may cover only a fraction of the whole 
solution space. Though the results are of the quality high enough for practical 
use, the space is too vast. An interesting open question from a theoretical point 
of view would be about reducing the size of the space. However, we conjecture 
that the size of the space can not be decreased drastically. 
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1. Timing 

1.1 Introduction 

Timing analysis is concerned with estimating and optimizing the performance 
of integrated circuits. It encompasses a wide range of activities including phys- 
ical modeling of transistors and interconnect wires, derivation of analytical and 
empirical gate and wire delay models, accurate estimation of long- and short- 
path delays through combinational logic, detection of setup and hold violations 
in sequential circuits, as well as a variety of combinational and sequential cir- 
cuit transformations aimed at maximizing operation speed. The three papers 
reviewed here represent particularly significant contributions to the field of tim- 
ing analysis and optimization: Brand and Iyengar [4] built the foundation for the 
field of false-path analysis; Szymanski and Shenoy [33] had the last word on the 
field of timing verification of latch-based circuits; and Shenoy and Rudell [29] 
are credited with making retiming viable for industrial-sized circuits. 

1.2 Functional Timing Analysis 

The origin of modem timing analysis algorithms can be traced back to the 
work of Kirkpatrick and Clark [17] who showed how the PERT (Project Eval- 
uation and Review Technique) method of Operations Research can be adapted 
to compute the delay and identify the critical paths of combinational logic cir- 
cuits. While the approach was fundamentally premised on ignoring the func- 
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tionality of a circuit’s logic gates, it is interesting to note that Kirkpatrick and 
Clark recognized the need to account for some functionality (different rise and 
fall delays, and distinction between AND and OR gates) in order to reduce the 
inherent pessimism in a function-less analysis. Despite the appeal of this ap- 
proach, its realization in timing analysis programs did not come quickly. One of 
the earliest reported implementations of these ideas is that of Hitchcock, Smith, 
and Cheng [11] ; they developed the Timing Analysis program and used it to 
detect timing bugs in the Processor Unit of the IBM 3081. 

Brand and Iyengar are credited with being the first to demonstrate the need 
for a fuller accounting of functional relationships during “block-oriented” tim- 
ing analysis. Their succinct 1986 ICCAD paper [4] (and its later journal version 
[5]) laid the foundation for the field of “false path analysis” and led, over the 
following decade and a half, to a proliferation of publications by many authors 
on various aspects of this topic [10, 21, 27, 2, 9, 7, 6, 30, 31, 36]. The impact 
that their contribution had can be attributed to two factors. First, they proposed 
an efficient procedure, based on their earlier work on logic synthesis [3], for 
collecting and checking for consistency the conditions for signal propagation 
along a circuit’s paths. This showed that such a functional analysis, which the- 
oretically increases the computational effort from linear to exponential, is still 
quite feasible in many cases. And second, they empirically demonstrated the 
existence of false paths (which they dubbed “non-functional” paths) in a vari- 
ety of benchmark circuits; furthermore, they showed that ignoring such paths in 
timing analysis can lead to significant over-estimations of circuit delay. 

1.3 Modeling and Verification of Clock Schedules 

Ensuring proper operation of synchronous sequential circuits amounts to 
checking that the applied clock schedule does not lead to hold violations (due to 
short signal paths) or setup violations (due to long signal paths) at the circuit’s 
state devices. For circuits that employ edge-triggered flip-flops, such checks are 
localized to the combinational logic between the flip-flops and can be carried 
out quite efficiently. The checks become significantly more involved, however, 
for circuits that use level-sensitive latches because such latches allow signals to 
“flow through” when the latches are enabled. The search for correct and efficient 
timing verification procedures of latch-based circuits began in the early eight- 
ies when level-sensitive latches started to replace the larger and slower edge- 
triggered flip-flops as the state device of choice in integrated circuits. 

While the “latch problem” was recognized in the early timing verifiers of 
McWilliams [22] and Agrawal [1], it was Jouppi [15, 16] who pointed out that 
the flow-through property of latches allowed a latch-based circuit to be operated 
at a higher clock frequency than that predicted by the maximum combinational 
delay between latches. He referred to this as “time borrowing” and presented a 
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slack propagation procedure to implement it. At about the same time, Ouster- 
hout [23] observed that ad-hoc solutions for the timing verification problem of 
latch-based circuits that employ multi-phase clocking could result in undetected 
timing errors and suggested that it is “the single greatest problem yet to be solved 
in timing verification.” From the mid-to-late eighties several authors proposed 
models and algorithms to solve this problem, most notably Unger and Tan [35], 
Szymanski [32], and Dagenais [8]. 

The first comprehensive mathematical model for latch-based multi-phase syn- 
chronous circuits was proposed by Sakallah, Mudge and Olukotun [28]. They 
introduced the notion of a “phase shift” operator that resulted in a significant 
simplification of the signal propagation equations between latches controlled by 
different clock phases. They also proposed and implemented algorithms for tim- 
ing verification (checkTc) and clock schedule optimization (minTc). The signif- 
icance of this development was quickly recognized by Szymanski and Shenoy 
who dubbed this the SMO model and proceeded to propose more efficient imple- 
mentations of the minTc [34] and checkTc [33] algorithms. In particular, their 
paper on verifying clock schedules [33] provided a thorough theoretical analy- 
sis of the SMO model and highlighted several subtleties (such as solution non- 
uniqueness and the need for careful initialization) in the checkTc verification al- 
gorithm that could lead to incorrect analysis results or to non-convergence. They 
also proved that the (correctly-implemented) verification algorithm can check an 
n-latch circuit in time in the worst case; practically, the algorithm runs in almost 
linear time for typical circuits with sparse connections between latches. 

1.4 Efficient Retiming 

Retiming transformations were first proposed by Leiserson and Saxe in 
1983 [19]. Using a simple model of sequential logic circuits, they presented 
several polynomial-time algorithms that can transform an initial circuit to an 
equivalent faster circuit. This was achieved by a sequence of register moves 
across the combinational logic that sought to balance delays between register 
stages. Many additional enhancements and extensions to this basic algorithm 
were proposed since its initial introduction. These include, among many oth- 
ers, retiming edge-triggered circuits under realistic delay models [18], retiming 
while accounting for hold constraints [26], and retiming of level-clocked cir- 
cuits [20, 24, 13, 14]. 

The paper by Shenoy and Rudell [29] observed that polynomial-time algo- 
rithms are too expensive in practice and suggested several algorithmic enhance- 
ments that enabled retiming to be viable for industrial-sized circuits. Specifi- 
cally, they proposed an early-termination condition for the procedure that checks 
if a given clock period is a feasible solution to a given retiming of the circuit. 
This condition is based on an efficient scheme for identifying negative-weight 
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cycles in a graph (similar to that used by Szymanski for computing optimal clock 
schedules [34]), and typically leads to a drastic reduction in runtime. Shenoy and 
Rudell also addressed the scalability of retiming by proposing and implementing 
an algorithm for minimum-area retiming whose worst-case complexity is worse 
than Leiserson and Saxe’s original algorithm, but whose memory complexity is 
linear instead of quadratic. 

2. Test 

2.1 Introduction 

Integrated circuit test is concerned with the problem of verifying that a man- 
ufactured integrated circuit meets its specifications and will operate reliably in 
a system. The most common specifications that are tested are function and per- 
formance over the operating voltage and temperature range. Two major long- 
term thrusts in test research are making chip designs easier to test, and relying 
increasingly on automatically-generated structural tests to screen for function, 
performance and reliability. Since chip complexity is increasing faster than pin 
count, design-for-test (DFT) hardware has had to be placed on chip to provide 
the necessary controllability and observability to achieve high test coverage of 
manufacturing defects. At the same time, falling transistor cost and rising per- 
formance has made it necessary and economically attractive to perform more of 
the testing with on-chip hardware, such as built-in self-test (BIST) and embed- 
ded deterministic test. The general problem is now one of test resource partition- 
ing, that is, dividing up the test resources between the chip and the automatic 
test equipment (ATE), and test synthesis, that is, automatically synthesizing the 
on-chip test hardware. 

As chip complexity has increased, it has become too expensive to rely solely 
on manually written functional tests, particularly for ASICs. Increasingly, au- 
tomatic test pattern generation (ATPG) is performed for digital circuits, using 
knowledge of circuit structure and the defects likely to occur in manufacturing. 
The many possible defect behaviors and locations are abstracted to logical fault 
models. The increasing performance of integrated circuits has shifted the focus 
of structural test for digital circuits to delay test, that is, testing for defects that 
cause a circuit to be too slow. 

These two long-term test research thrusts come together in the paper selected 
for the Test section. It is commonly the case that ATPG cannot achieve 100% 
coverage of all the targeted circuit faults. In their ICCAD88 paper [37] and their 
follow-on journal article [38], Kundu, Reddy and Jha showed that for CMOS 
combinational logic circuits, such limitations are due to the circuit structure, 
rather than the logic function. They described an algorithm to synthesize com- 
binational logic functions so as to be robustly testable for multiple occurrences 
of stuck-at, stuck-open and path delay fault models. This paper sparked sev- 
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eral years of research on synthesis for testability, much of which appeared at 
ICCAD. 



2.2 Synthesis for Testability 

There is a large body of prior research on synthesizing circuits to increase 
their testability [39]. Much of this research has focused on enhancing stuck-at 
fault testability in combinational and sequential logic blocks, much of which 
appeared at ICCAD [40, 41, 42, 43]. There has also been extensive research 
on synthesis for path delay testability [44, 45, 46] and random pattern testa- 
bility [47], with some important papers on these topics at ICCAD [48, 49]. If 
testability is not considered during timing optimization, it can make path delay 
test more difficult [50]. The synthesis method of Kundu, Reddy and Jha [37] 
produced circuits that are testable under multiple faults for both stuck and path 
delay fault models. The primary challenge of their approach is the restriction 
to unate circuits. For many functions, forming unate circuits results in an unac- 
ceptable increase in circuit size. The ICCAD papers that followed were able to 
achieve testability without this restriction. In today’s large scale designs, testa- 
bility synthesis research has shifted to focus on high-level design descriptions 
[51, 52, 53, 54], including work appearing at ICCAD [55, 56]. 

Industrial experience on large designs is that achieving high stuck-at fault 
coverage is a struggle, even with extensive DFT features. This is due to em- 
bedded arrays, buses, and structures that cannot have test circuitry included, 
and the difficulties they cause for ATPG. Most current delay test synthesis re- 
search is focused on improving test access via scan chains [58, 59], building 
upon the pioneering work at IBM [57], and applying the two-pattern tests via 
low-overhead test logic and constrained tester resources [60]. System-on-a-chip 
(SoC) designs present additional challenges and standardization requirements 
in providing test access to on-chip modules which internally have their own test 
methods [61, 62, 63, 64], but little of this has appeared at ICCAD. 

2.3 Automatic Test Pattern Generation 

The primary industrial impact of test research has been the development of 
automatic test pattern generation (ATPG) tools. The primary enabler was 
the wide adoption of scan chains. Most ATPG research appeared in other fo- 
rums, but ICCAD contributed with pioneering research on targeting realistic 
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faults [65], improved search algorithms [66], and IDDQ testing [67]. Even “full” 
scan circuits contain sequential logic, and ICCAD served as a forum for some 
early sequential circuit ATPG research [68, 69]. 

As circuit performance increases, delay fault ATPG becomes increasingly 
important. There are two widely used delay fault models: transition and path 
delay. The transition fault model assumes that a slow transition on a single line 
is so large that any path through it will be slow. Smaller delay defects can be 
detected by testing the longest paths through each gate, with an early paper on 
this topic at ICCAD [71]. The path delay fault model assumes that a path may 
be slow, and targets distributed delay due to manufacturing process variations. 
Robust path delay tests are valid even in the presence of arbitrary circuit delays, 
while nonrobust tests assume a single slow path. Combinational CMOS circuits 
synthesized with Kundu, Reddy and Jha’s technique [37] are robustly testable. 
A large number of path delay fault ATPG tools have been developed [72, 73, 74], 
with many important papers appearing at ICCAD [75, 76, 77, 78]. Transition 
fault tests are widely used since they can be generated with a simple modification 
to a stuck-at ATPG. The traditional fault models are being extended to more 
directly target defects, termed defect-based test. These include delay variation 
due to capacitive elfects and noise [79, 80], resistive bridges and opens [81, 82], 
or their combination [83, 84]. In addition, it has been shown that since a 
fault model cannot include all possible defect behaviors, non-target tests may 
be better for achieving high quality [85]. Only a small amount of defect-based 
testing research has appeared at ICCAD [86]. 

3. Manufacturing 

3.1 Introduction 

Manufacturing research is usually associated with silicon technology devel- 
opment and is the major focus of such conferences as the lEDM. As the interface 
between the design and manufacturing of integrated circuits grew more compli- 
cated, however, there was a need for CAD tools and CAD developers to get in- 
volved in order to solve some of the pressing issues that were causing problems. 
Thus a small group of people, by the standards of the overall CAD community, 
crusaded for improved design/technology coupling through automation. This 
section chronicles this segment of CAD, as relating to ICCAD, over the last 
twenty years. 

3.2 Spice and Beyond 

The wide availability and popularity of Spice [87] made it the de-facto stan- 
dard interface between design and manufacturing. Much of the early work fo- 
cused on methods that would either generate the statistically varying Spice pa- 
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rameters (e.g. [88, 90] ), or methods that would summarize them in a manner 
conducive to reducing the number of simulations requires, i.e. the paper selected 
for this book [89] and the follow-on journal article [92]. and book chapter [91] 
These early works set the stage for the current pervasive industrial practice of 
creating comer cases to model the inherent variability in the IC manufacturing 
process. 

Throughout the eighties and early nineties, a number of other important pa- 
pers appeared in ICCAD focusing on variability modeling and on circuit opti- 
mization to improve yield, e.g. [95, 97, 98, 99]. The second selected paper in 
this area, [101], represents the formal integration of worst case analysis and cir- 
cuit yield maximization. As technology scaling continued and circuits became 
too large to analyze in detail using Spice, other works appeared, such as [102] 
where an early attempt at performing statistical timing was presented, and [103] 
where an early attempt at dealing with within-die variations was proposed. 

3.3 Defects and Yield 

Another lively area of early manufacturing research in ICCAD was that of 
modeling catastrophic defects (short, opens and other layout deformations). From 
the early work in [93, 94] to the seminal work [96] and the reporting of one the 
early CAD tools to deal with this problem [100], ICCAD provided an early fo- 
rum for the design/manufacturing interface in this practically important area. 

In spite of the excellent tutorial [104] in 1996, little has happened in ICCAD 
since the late nineties. Other conferences, particularly ISQED, have become the 
target of papers that focus on the design/manufacturing interface. Nevertheless, 
much of the work originally reported in ICCAD remains in active use today. 
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Abstract 

This paper presents a formal approach to the worst-case design of Integrated Circuits, yielding 
realistic estimates of variations in device performances. The worst-case analysis is performed in 
terms of statistically independent process disturbances and employs the statistical process simu- 
lator, FABRICS II. 



1. Proposed Methodology 

In order to achieve satisfactory manufacturing yield, the design of an IC is 
usually verified under some worst-case conditions. This process, called worst- 
case design, has been traditionally performed in terms of the electrical param- 
eters of the IC devices, such as threshold voltages and transconductances for 
MOSFET’s. In reality, however, these parameters are statistically dependent 
(correlated) random variables with a multilevel structure of variance (intra-die 
and inter-die). In traditional approaches to worst-case design, the correlation 
coefficients between device parameters are not taken into account, and therefore 
IC performances are estimated for some unrealistic combinations of the device 
parameters. Hence the results of such an analysis are usually too pessimistic. 

In this paper we propose a more rigid approach to worst-case design, which 
yields realistic estimates to variations in device performances. This approach is 
based upon the availability of the statistical process simulator, FABRICS II [2] 
which contains a sequence of process and device models. FABRICS II generates 
samples of device parameters for a set of process and layout parameters, as 
well as process disturbances which model random fluctuations inherent in the IC 
fabrication process (e.g. diffusivity of impurity atoms, or linewidth variations 
and misalignments in the lithography). 

The simulator can be tuned to a particular IC manufacturing process (i.e. the 
joint probability density function of device parameters estimated from simulated 
data is in good agreement with measured results) by finding the probability dis- 
tribution functions of the process disturbances. Since the process disturbances 
are chosen in FABRICS in such a way that they are statistically independent [1], 
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we propose using these parameters in the worst-case analysis. The exact worst- 
case analysis has to be carried out for each IC performance separately (e.g. for 
average power dissipated in the IC or inertial delay of a signal). 

In the approach proposed, the selection of the significant worst case parame- 
ters is performed based upon the sensitivities of the performances to process dis- 
turbances. The sensitivities are estimated by perturbations using data obtained 
from FABRICS II, tuned to a particular IC fabrication process, coupled with 
a circuit simulator. Since sensitivities are local estimates of this dependence, 
a more accurate method of approximating the relationship over a wide range of 
changes in process disturbances is to build non-linear regression models relating 
performances to the process disturbances. Such models have been successfully 
built [3] and it was found that these dependencies are monotonic over a wide 
range of process disturbances. Therefore, sensitivities can be reliably estimated 
by large perturbations and the worst-case combinations of significant process 
disturbances can be obtained for each IC performance. Then realistic worst case 
sets of device parameters may be generated by FABRICS II if the significant 
process disturbances are changed by, for example, one standard deviation from 
their identified values in the “worst-case direction”. Due to independence of the 
process disturbances, it is possible to estimate the probability of occurrence of 
this case, which would be valuable information for IC designers. Furthermore, 
if the device models are defined in such a way as to be independent of device 
dimensions, then the designer may alter IC layout to improve performance. If, 
however, some changes in the fabrication process parameters are necessary, the 
worst-case device models have to be evaluated once again. Observe that such 
an evaluation is computationally inexpensive because in the proposed approach 
Monte Carlo simulations are not required. Moreover, the process disturbances 
are chosen to be independent of process parameters and layout dimensions, so 
FABRICS II does not have to be re-tuned. 

Theoretically, worst-case device models have to be evaluated for each IC 
under consideration. However, we can obtain these models for typical perfor- 
mances of the basic cells of VLSI circuits (e.g. power and speed of an inverter) 
and use these models for the approximate worst-case analysis of large IC’s. Note 
that due to the multilevel structure of the process disturbances implemented in 
FABRICS II, the worst-case design can be performed for both intra-die and 
inter-die fluctuations. 

To illustrate the methodology proposed in this paper we present an example 
of a worst-case analysis for an exclusive-OR gate with two inputs. This gate is 
implemented in a 3 NMOS process and consists of 7 MOS transistors. The 
gate was loaded with a capacitive load equivalent to a fanout of 4. The worst- 
case analysis was performed for the following {jerformances; 



O P - power dissipated 
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O Td - delay from input to output signals 
O Tr - time of the output signal 
O T/ - time of the output signal 

The sensitivity analysis was performed based on data obtained from FAB- 
RICS II tuned to this NMOS fabrication process and SPICE. A number of sim- 
ulations were performed: for the nominal design (with the disturbances equal 
to their mean values) and for two perturbations of each process disturbance. 
The sensitivities of the circuit performances with respect to each process dis- 
turbances were calculated and normalized by the corresponding nominal values. 
The performances under consideration were most sensitive to the following pro- 
cess disturbances: 

O Ln - linewidth variation in nitride lithography 
O Lp- linewidth variation in polysilicon lithography 
O Dfl - Boron diffusivity 
O Dp,s - Arsenic diffusivity 
O Rox - parabolic dry oxide growth rate 

The normalized sensitivities of the circuit performances to these process dis- 
turbances are given in Table 1. 



performance 


Ln 


Lp 


Db 


L^As 


Rox 


P 


-0.199 


0.137 


-0.025 


-0.064 


-0.211 


Tr 


0.158 


-0.282 


-0.001 


-0.188 


-0.028 


y 


0.096 


-0.466 


0.089 


-0.170 


0.603 


y 


0.127 


-0.352 


0.459 


-0.304 


0.247 



Table 1. Sensitivity of performance to process disturbances. 

The worst-case combination of the process disturbances for each circuit per- 
formance was determined based upon the signs of the sensitivities. We decided 
to verify the performance of the circuit under consideration in the case when 
all the process disturbances are shifted by two standard deviations from their 
respective mean values in the worst-case direction. As an example, the nomi- 
nal power dissipation of 0.58mW was increased to l.lmW in the worst case for 
power, similarly, the nominal inertial delay of 1.8ns was increased to a worst 
case value of 3.56ns. 

The most significant device model parameters corresponding to these cases 
are shown in Table 2. 
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V,hD 

(V) 


KPD 

(a/v^) 


V.hE 

(V) 


KPE 


Tax 

(A) 


(pm) 


nominal 

-3.6 


3.32E-5 


1.18 


4.16E-5 


817 


0.403 


WCl-P 

-3.75 


3.55E-5 


1.04 


4.71E-5 


739 


0.436 


WC2-t^ 

-3.55 


3.18E-5 


1.27 


3.87E-5 


870 


0.358 



Table 2. Nominal and worst case device performance. 



2. Summary 

In closing, we presented a methodology for a formal analysis of the worst 
case performance of IC’s, which gives results that are accurate, and is computa- 
tionally efficient. Furthermore, the same methodology may be used to generate 
generic worst case device parameters independent of circuit layout. 
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Abstract 

The usual block oriented timing analysis for logic circuits does not take into account functional 
relations between signals. If we take functional relations into consideration we may find that 
a long path is never activated. This observation can be used to calculate improved and more 
accurate delays. It is not practical to consider the complete truth table with all the relationships 
between signals. We use a procedure that considers only a subset of the relationships between 
signals and checks for non-functional paths. The delay calculation then takes into account the 
discovered non-functional paths to determine less pessimistic delays through the logic. 



1. Motivation 

Static timing analysis tools [1] use block oriented algorithms to compute the 
worst case delays in a combinational network. The arrival time of a signal, 
namely when the value of that signal is valid, is calculated assuming that infor- 
mation propagates over all the paths to that signal. 




Figure 1. Simple example of a logic circuit 

An example of a combinational network is shown in Figure 1. It contains 
four gates with their functions indicated. It contains seven signals A, B, C, D, 
E, X, Y and nine connections A, B, Cl, C2, D, E, XI, X2, Y. (A signal is a 
set of connections with a common source.) There are four paths to the con- 
nection Y, namely A*X2*Y, B*D*E*X2*Y, C1*E*X2*Y, C2*Y. A path is an 
ordered sequence of connections in the usual sense and we use * as a separator 
or concatenation symbol. 
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In all our examples we will assume, for simplicity, that each logic block 
(AND, OR, INV) has a delay of 1 unit to propagate a value from any of its 
inputs to any of its outputs. Let us also assume that the primary inputs A, B, C 
are available at time 0. A block oriented algorithm such as [1] would compute 
the arrival time at Y to be 4 units. We can see that this is too pessimistic by con- 
sidering the analysis in Table 1. The two cases in Table 1 correspond to values 
0 and 1 for the primary input C. Input A has value a, input B has value b. When 
C is 0 the output Y reaches its steady state (value a) after only 3 units of time. 
When C is 1 the output Y reaches its final value 1 in just 1 unit of time. In both 
cases, the signal Y does have a valid value at time 3. 

The block oriented algorithm did not consider the logical relationship be- 
tween the connections Cl and C2. The long path B*D*E*X2*Y that contributed 
to the 4 unit arrival time at Y is always blocked since the AND gate and the OR 
gate cannot get gating values simultaneously. 

It is not true in general that reduction in arrival time at a primary output im- 
plies non-testability of a connection in the sense of [2] (stuck-at faults) through 
that primary output. For example, the single output circuit in Figure 2 has no 
untestable stuck faults (stuck-at-zero and stuck-at-one) on its connections. How- 
ever, the path B*P*R2 is always blocked and taking this into account could im- 
prove the arrival time of the output O (for some combination of arrival times on 
the inputs). 



_BL 



AND-Y 



B- 



A-r INV 



OR 



AND] 



AND 



R2r 



OR 



AND 



IL 



OR 



— 0 



F— 



AND- 



Primary inputs: A, B, C, E, F, G, G. 



Figure 2. Example of a single output logic circuit with the path B*P*R2 always blocked but 
without any untestable stuck faults. 

Conversely it is not true in general that non-testability of a connection through 
a primary output implies that the delay from that connection to the primary 
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output can be ignored. For example, in Figure 3, the connections A1 and B1 
could be removed without changing the function of the output G. However, that 
does not imply that the arrival time at G can be reduced to 2; the input A=0, 
B=0, C=0 requires three stages to propagate to G. 

In our analysis we will only consider combinational logic networks. All the 
input combinations to these networks are assumed possible. For a synchronous 
sequential machine we will extract the combinational logic determining the out- 
puts and values latched in each cycle. It is possible that certain combinations of 
latch states are not achievable during normal operation. This could then also re- 
sult in various blocked paths. Our analysis considering only one cycle paths will 
not identify such blocked paths, unless the impossible combinations are given 
to us explicitly as don’t cares. 




Figure 3. Example of a logic circuit with some untestable faults. 



2. Algorithm 

We will briefly describe the algorithm here. More details and a proof of 
correctness can be found in [3]. It is based on tracing a number of paths from an 
output towards inputs and collecting conditions under which the path is blocked. 
If it obtains an inconsistent condition then the path is always blocked and its 
contribution to the output’s arrival time can be ignored. 

For example assume that we trace the longest path in Figure 1. We start at 
the output Y and try to proceed through the connection X2. In order to avoid 
blockage we must set C=0. Then from X we proceed through E, for which 
we must set A=0. In order to go from E through D, we must set C=l, which 
is inconsistent with our previous requirements and therefore we can ignore the 
arrival time of D. Ignoring the arrival time of a signal is equivalent to assuming 
that it is always available. 
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An important part of our timing analysis is a procedure that collects condi- 
tions, derives more conclusions from them and checks for consistency. We use 
the same procedure as in [4]. It makes the following deductions: 

if c = AND(a, b) and a=l, b=l then c=l. 
if c = AND(a, b) and c=0, b=l then a=0. 
if c = AND(a, b) and a=0 then c=0. 
if c = AND(a, b) and c=l then a=l, b=l. 

It also makes all the other deductions obtained by adding gate inputs, permuting 
gate inputs, or changing the gate’s function. However, it will never split into 
cases; if it is given for the above AND gate that c=0 then it will not split into the 
two case a=0 and b=0 to see if both lead to inconsistency. The failure to split into 
cases means that the procedure may declare a set of conditions consistent, even 
though they are not. This may cause our delay calculation to be too conservative, 
but it will not cause it to be wrong. Completeness is sacrificed for performance 
reasons; we find that we have a good trade-off between efficiency and deductive 
power. 



3. Experimental results 

The method was implemented in PL/I as part of the logic synthesis system 
[5] and run on IBM 3081 and IBM 4381 computers. Two types of experiments 
were done. In the first one our method of doing timing analysis was applied to 
several pieces of logic in order to determine to what degree one can improve on 
the usual timing analysis. All the examples were implemented in the book set of 
LSI Inc. 5000 Series TFLH [6] and corresponding delay equations were used. 
We set our timing specifications to be not achievable, so that all outputs and all 
paths were late. This had two consequences - maximum reduction in arrival 
time, and larger CPU time that what one can expect with more realistic timing 
requirements. 

Table 2 shows for each example 

(i) size in terms of the number of gates and connections, 

(ii) the total number of outputs and number of outputs with reduced 
arrival time, 

(iii) CPU time for the standard timing analysis B, CPU time to com- 
pute the reduced arrival time R for all outputs as well as per 
output. 

Table 3 shows the amount of reduction for all the outputs. Each line corre- 
sponds to one example; the last line is the total over all. The column labeled 
“0” gives the number of outputs with no reduction. The column labeled “0 to 2” 
gives the number of outputs whose reduction was more that 0, but no more that 
2%. The last column given the number of outputs with a reduction between 60% 
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and 62%. The percentage of reduction for an output O is calculated as follows: 
Let b be the lower bound of B(0) and let r be the lower bound of R(0). Then 
the percentage is 100*(b-r)^. 

In the second type of experiments we investigated the effect of the delay 
model on the amount of reduction achieved. The examples used for this were 
originally used as test cases for test generation programs [7]. The examples are 
described in terms of generic logic gates (ANDs, ORs, etc.). Exclusive ORs 
were expanded in our analysis. Two delay models were used for the analysis. 
The first was a unit delay model at the gate level and the second was a model 
where the delay through a gate is proportional to the logarithm of the number 
of inputs to the gate (one input gates have zero delay). The log delay model is 
applicable in estimating delay for a dynamic CMOS technology. 

Table 4 gives the characteristics of of the examples in the second set. Table 5 
summarizes the reduction achieved for these examples using both delay models. 
Lastly, Table 6 gives the actual reduction of arrival times for outputs of those 
examples that were reduced. It is clear that the analysis and the arrival time 
reduction is sensitive to the actual delay model. 

4. Conclusions 

The purpose of our study was to see whether one can reduce delay by consid- 
ering functional relations between signals. The answer is affirmative and Table 
3 shows the amount of reduction. 

Another question is what happens in large pieces of logic, which are of real 
interest. One can expect that a longer path is more likely reducible than a short 
one; our data are consistent with this expectation. Unfortunately we were unable 
to run larger examples; our implementation did not have performance as its main 
objective, and indeed proved too slow. 

In terms of CPU time our performance is much worse than the standard tim- 
ing analysis of [1]. This is likely to remain true for any algorithm that takes 
functionality into consideration. Therefore this type of analysis probably cannot 
be applied to the whole machine, only to some troublesome areas. 

We cannot draw a conclusion that with our method one could reduce the cycle 
time of a machine. Because critical paths may be implemented with sufficient 
care, taking functionality into consideration. However, we can expect that one 
could save on area and power, which must be sometimes used to speed up some 
slow paths. 
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Table 1. Propagation of signal values in the circuit shown in Figure 1. 
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Table 2. Characteristics of one set of examples and run times. 



No. 


SIZE 


Num OUTPUTS 


CPU TIME on IBM 3091K min:sec 




Gates 


Connects 


Total 


Reduced 


For B 


ForR 










byR 




Total 


Per Output 


1 


331 


606 


23 


12 


0:06 


0:28 


0:01 


2 


466 


854 


54 


0 


0:08 


0:11 


0:01 


3 


526 


975 


69 


0 


0:04 


1:19 


0:01 


4 


787 


1353 


106 


0 


0:16 


0:26 


0:01 


5 


1030 


2375 


107 


0 


0:24 


0:52 


0:01 


6 


1628 


3008 


79 


0 


0:11 


0:43 


0:01 


7 


3049 


6355 


516 


69 


0:23 


24:00 


0:03 


8 


4948 


9914 


602 


73 


0:35 


80:22 


0:08 



Table 3, Reduction in arrival time for outputs in the first set of examples. 
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18 

to 

20 


20 

to 

22 


22 

to 

24 


26 

to 

28 


28 

to 

30 


60 

to 

62 


1 


11 














4 


3 


4 


1 












2 


54 
































3 


69 
































4 


106 
































5 


107 
































6 


71 


2 






3 


1 


1 








1 












7 


447 




19 


16 




1 


9 


13 


1 




3 


1 


3 


1 


1 


1 


8 


529 


3 


4 




33 


18 


7 


5 


2 


1 














Tot. 


1369 


5 


23 


16 


36 


20 


17 


22 


6 


5 


5 


1 


3 


1 


1 


1 



Table 4- Characteristics for the second set of examples. 



Name 


SIZE 


GATES 


SIGNALS 


C432 


160 


196 


C499 


202 


243 


C880 


383 


443 


C1355 


546 


587 


C1908 


880 


913 


C2670 


1193 


1426 


C3540 


1669 


1719 


C5315 


2307 


2485 


C6288 


2416 


2448 


C7522 


3512 


3719 
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Table 5. Reduction achieved and run time for two delay models on the second set of examples. 
( ♦ The analysis in this case did not complete.) 



Name 


Total 
no. of 
outputs 


Unit delay model 


Log delay model 


no. of outputs 
reduced by R 


CPU Time for R 
(on IBM 4381) 


no. of outputs 
reduced by R 


CPU Time for R 
(on IBM 4381) 


C432 


7 


0 


0:33 


0 


0:12 


C499 


32 


0 


0:36 


0 


0:35 


C880 


26 


0 


0:18 


0 


0:20 


C1355 


32 


0 


0:46 


0 


0:54 


C1908 


25 


8 


8:29 


6 


9:40 


C2670 


140 


0 


3:14 


0 


5:32 


C3540 


22 


15 


100:16 


6 


4:31 


C5315 


123 


6 


55:18 


0 


3:15 


C6288 


32 


0 


5:32 


♦ 


* 


C7522 


108 


13 


35:58 


0 


5:32 



Table 6. Reduction in arrival time for the outputs for examples in the second set for two delay 
models. (Note: Only examples for which any reduction occurred included) 



Exa- 

mples 


Delay 

Model 


No. of outputs with arrival time reduced by given percentage 


0 2 4 6 8 10 16 30 34 38 42 44 48 

to to to to to to to to to to to to to 

0 2 4 6 8 10 12 18 32 36 40 44 46 50 


C1908 

C1908 


unit 

log 


17 2 2 1 1 1 1 

19 1 5 


C3540 

C3540 


unit 

log 


7 1 5 3 2 2 1 1 

16 5 1 


C5315 

C5315 


unit 

log 


117 4 11 

123 


C7522 

C7522 


unit 

log 


95 1 12 
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Abstract 

It is known that circuit delays and timing skews in input changes influence choice of tests to 
detect delay faults. Tests for stuck-open faults in CMOS logic circuits could also be invalidated 
by circuit delays and timing skews in input changes. Tests that detect modeled faults independent 
of the delays in the circuit under test are called robust tests. An integrated approach to the design 
of combinational logic circuits in which all single stuck-open faults and path delay faults are 
detectable by robust tests was presented by the authors earlier. This paper considers design of 
CMOS combinational logic circuits in which all multiple stuck-at, stuck-open and all multiple 
path delay faults are robustly testable. 



1. Introduction 

Testing is needed to ensure reliability of VLSI chips. There are two areas 
of testing. One is to verify the input-output logic relations of a circuit, and the 
other is to ensure that path delays in manufactured circuits meet specifications. 
CMOS has emerged as the dominant technology for manufacturing digital logic 
circuits. The classical fault model consisting of circuit lines stuck-at-0 or 1 
faults was shown to be inadequate for CMOS circuits [1-3]. Transistor stuck-on 
(TSON) and transistor stuck-open (TSOP) faults were added to this model to 
capture the effects of physical faults [4]. Testing TSOP faults in a static CMOS 
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combinational logic circuit requires two-pattem testing [4,5]. A two-pattern test 
< Ti , 72 > consists of a sequence of two inputs, the first one of which is called 
the initializing input (7i) and the later one is called the test input (Ti). 

Among the two popular delay fault models, namely the gate delay fault model 
and path delay fault model, the latter is deemed to be more general since it 
captures the cumulative effect of small delay variations in gates along a path [7] 
as well as the faults caused by a single gate. We use path delay fault model in 
this paper. 

In a two-pattem testing environment when an initializing input is changed 
to a test input, transients may be produced on circuit lines causing test invali- 
dation problem [4]. In path delay testing, besides test invalidation problem, it 
may not always be possible to sensitize a signal transition to an output along a 
desired path of a given circuit, thus rendering the selected path untestable [8]. 
To circumvent the test invalidation problem, tests called robust tests were in- 
troduced [9]. Robust tests for TSOP faults in CMOS circuits are defined to be 
the two-pattem tests that are not invalidated by transient signals caused by ar- 
bitrary circuit delays and/or timing skews in input changes. A robust test for a 
path-delay fault is a two-pattem test that is not invalidated by transient inputs 
and delays in other paths. However, for given faults, in a circuit under test, such 
robust tests may not exist [4,9]. 

To overcome these and other problems in testing, a necessity has always been 
felt for testable designs. With earlier testable designs, it is not possible to design 
reliable CMOS gates with fan-in greater than 4-8, without using extra inputs [4- 
6,9,10] called control inputs. Recently, we have presented a testable design that 
can accommodate any fan-in restriction, and requires no extra input or special 
gates. These designs are robustly testable for all single TSOP and all multiple 
path-delay faults [11]. In this paper, it is shown that the circuit produced by the 
design procedure given in [11] is testable with respect to multiple stuck-at and 
stuck-open faults as well. The tests for multiple faults are given. 

2. Preliminaries 

In this section, relevant earlier results are reviewed and notation used is de- 
veloped. 

A function F of « variables xi ,X 2 , ...,x„ is said to be positive unate {negative 
unate) in variable x; iff F(xi,X 2 , ...,x„), a two-level logic expression for F, exists 
using Xi exclusively in uncomplemented (complemented) form [12]. For exam- 
ple, F{a,b,c,d) =ab-\- cbd ca ad is positive unate in b, negative unate in 
d and non-unate or binate in a or c. A unate function {positive unate function) 
is unate (positive unate) in all of its variables. Given that F is a unate function, 
it is known that minimal sum of products (SOP) as well as minimal product of 
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sums (POS) expressions for F are unique. A two-level circuit of NAND (NOR) 
gates realizes a function in its sum of products form (product of sums form). 

If F{x\,X 2 , ...,x„) is a non-unate function and x,- is a non-unate variable of F, 
then from Shannon’s expansion theorem, 



where, Fjj and are defined as below: 

Fxi = F{xuX2,...,Xi-i,0,Xi+u-,Xn) 

Fx, = F{xi,X2,...,Xi-i,l,Xi+i,...,Xn) 

Fxi and Fx^^t independent of x/ and depend on the other (n - 1) variables. 
The decomposition of Equation (1) can be implemented by the circuit shown in 
Figure 1(a) and Equation (2) can be implemented by the circuit in Figure 1(b). 

The authors have earlier established that if a unate function F is realized by a 
two-level NAND (NOR) gate circuit then it is robustly testable for all path delay 
faults [11]. Furthermore, high fan-in gates in a testable two-level realization 
of a unate function can be replaced by trees of primitive gates of lower fan-in 
without affecting single fault testability. Thus, testable realizations with fan- 
in restrictions can be obtained for unate functions. It was also shown that if a 
two-level circuit is robustly testable for all path delay faults, then it is robustly 
testable for all single TSOP faults, and the following procedure was suggested 
to derive testable designs for an arbitrary function. 
Procedure_Testable_Design(F) 

Step 1: Minimize function F for a two-level design. 

Step 2: Check if the path delay faults in the two-level circuit resulting from the 
above minimization are detectable. If yes, stop here. 

Step 3: If the result of Step 2 is negative, then apply heuristic for selecting the 
splitting variable (x,). 

Step 4: Call Procedure_Testable_Design (/^,) and Procedure-Testable -Design 
(Fxi) to realize the circuit form shown in Figure 1. 



In this section, we prove that the circuit produced by the above procedure is 
robustly testable with respect to all multiple line stuck-at and TSOP faults. We 
need not consider multiple path delay faults for it was shown in [11] that the 
circuit produced by Procedure-Testable-Design(F) is testable for all multiple 
path delay faults. 

A Useful Review 

Given a product 7) of a positive unate function, F = Xi-Xi.^...Xi^, the vertex 
corresponding to Pi, named Vp,, is the binary n-tuple that has Is in positions 
I'l , i 2 , ..., ir and Os in all other positions. Similarly, given a binary n-tuple X, the 



F{xi,X2, ...,X„) = XiFxi -\-XiFx- and, 

F{xi,X2,...,Xn) = {Xi-\-Fxi){Xi-\-Fxi) 



( 1 ) 

( 2 ) 



3. Multiple-fault Testability 



578 



THE BEST OF ICCAD 



product corresponding to X, named Px, is obtained by including the uncomple- 
mented variable in Px corresponding to each position in X with 1. 

Example; Let jci,jc 2,X3,X4 ,jc 5 be the variables. If Q is the product X 2 X 4 X 5 then 
the corresponding vertex Vq is 01011, and if V = 10110 is a vertex then the 
corresponding product is xix^x^. 

A zero cofactor (0-cofactor) of Pi of a positive unate function is obtained by 
substituting 0 for a 1 in Vp. and is denoted by Vpi{k), when a 1 at position k, 
is flipped to a 0. Thus, if Pi consists of r literals, it has r 0-cofactors. In the 
example above, ^2X4X5 is a product term. So Vx 2 X 4 xs is 01011 and Vxix^xsi^) = 
01001. 

In the method to be proposed for the design of robustly testable circuits, an 
important fact used is that robust tests for two-level NAND-NAND gate realiza- 
tions of the minimum SOP of a positive unate function F can be derived from 
the minimal true vertices of F and their 0-cofactors. For this, it is useful to 
discuss the following properties of 0-cofactors and minimal true vertices. 

Let F = Pi -h P2 + ••• + Pm be the minimal sum of products of the unate func- 
tion F. Then, clearly, F(Vpi) = 1, 1 < i < m. Also, if Vp(^) is a 0-cofactor of Pi 
then F{Vpj{k)) = 0. Let F be realized by a NAND-NAI^ circuit N, shown in 
Figure 2, where a first-level gate Gj corresponds to the product P,-, and Go is the 
output gate. 

Theorems 1-4 presented in the sequel assume that two-level circuits are con- 
structed of NAND gates. The proofs for the two-level circuits of NOR gates 
can be easily established as the dual case. For easy understanding, the proof of 
Theorem 1 is given for positive unate functions only. Extension of the proof for 
more general unate function is trivial on the lines discussed earlier [11]. 
Theorem 1: A two-level irredundant realization of a positive unate function F 
is robustly testable with respect to all multiple stuck-at and stuck-open faults. 
Proof: A first-level gate G,- realizes the complement of a product Pi. Let us 
consider tests for pFET TSOP faults first. If the pFET of Gi connected to input 
x,'i is to be tested for a TSOP fault, then we would normally apply < V/>p {i\ ) > 
for fault detection. The output of each gate Gy, 1 < y < m and j^i, is supposed 
to be 1 for both inputs in the two-pattern sequences while the output of gates Gf 
and Go are expected to change from 0 1 and 1 0, respectively. If a 1 0 

is observed at the output of Go the following may be concluded. 

1. nFETs of Go are fault free. 

2. Vpi(ii) produces all Is at the input of Go- 

3. If the TSOP fault of the pFET being considered or a line stuck-at 1 fault for 
the input Xi^ of G; exists, it can not produce a 1-to-O transition at the output of 
Go unless 

1. one or more nFETs of Gi are open, or, 

2. there is a line stuck-at 1 fault at the output of Gi- 
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If G, has an nFET(s) TSOP fault and the output of G, is known to be 1 at 
some point, then it may be assumed to be a line stuck-at 1 fault thereafter. So, if 
the test performed above is invalidated, we may assume that the output of Gj has 
a stuck-at 1 fault. In that case the 1 produced at the output of Go when Vp, was 
applied must be due to some other gate(s) G; producing a 0 for this input. This 
could happen only if G/ had a fault, more specifically if G/ had an input, say 
Xn (or several inputs) stuck-at 1, or a pFET of G/ connected to the input x„ was 
stuck-open and the input preceding V/x initialized Gi to 0. If it was a matter of 
TSOP faults alone then the input Vp,{i\ ) followed by would be able to detect 
the faults, since the output of G/ cannot return to 0 in that case. 

If the test sequence < Vpp Vp{ii), Vp, Vp{i 2 ), Vp, , Vp^{ir) > is applied to 

the circuit, from the arguments presented above it is clear that in presence of a 

fault in Gi, a correct output sequence can be obtained at Go iff JC;, , , , x,; are 

all inputs to Gi, but then the product realized by Gi is redundant, contradicting 
the postulate in this theorem. One could use the same argument for all gates 
Gj, I < j <m. Note that selecting this order ensures that we do not have to 
apply any more tests than those needed for single fault detection. The tests 
above also detects any fault in Gg, so we do not have to apply any additional 
tests. 



< Q.E.D > 

In the theorem presented above, we have actually constructed tests for mul- 
tiple faults. The test length being no more than that for single faults is natu- 
rally the best we could attain. The following theorem is about decomposition 
of primitive gates into tree structures. Theorems 1 and 2 together give us a tool 
to achieve multi-level testable design for unate functions satisfying all fan-in 
requirements. 

Theorem 2: If a high fan-in primitive gate is replaced by a primitive gate tree 
then the tests for single TSOP faults of the tree circuit would detect all multiple 
faults in the tree, which consist of stuck-open and stuck-at faults. 

Proof: Decomposition of a primitive gate (such as NAND or NOR) leads to 
a CMOS circuit that is absolutely fan-out free. There are no fan-outs even at 
the primary input level. Every primary input of this circuit has a unique path 
to the output. Let d> be the set of faults present in this circuit. <I> consists of 

{(|)i,<t) 2 , ,(])„}. Let the fault (])i lie in the path from primary input xi to the 

output. A test for this fault under the single fault assumption model essentially 
involves changing input Xi while holding all others constant. When this test is 
applied, two situations are possible, namely, (1) the path on which (|)i lies is 
sensitized to the output even in the presence of multiple faults, (2) the path is 
not sensitized. In Case 1, even if another fault is present in the path on which 
(|)i is situated it would be detected. In Case 2, there is nothing left to prove. The 
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above proof is similar to proof given in [13] for robust multiple stuck-open fault 
detection in fan-out free circuits. 



< Q.E.D > 

Comments : The tests for all single TSOP faults of the high fan-in gate may not 
be adequate to test for all single faults of the gate tree circuit. 

To illustrate this consider a NAND gate with inputs a, b, c, d, e. If fan-in 
is restricted to 2, then the gate may be decomposed as shown in Figure 3. A 
complete set of tests for the high fan-in gate is: 

< 11111,01111 > 

< 11111,10111 > 

< 11111,11011 > 

< 11111,11101 > 

< 11111,11110 > 

< 11110,11111 > 

Suppose, an nFET of the gate realizing abcde is stuck-open then the test 

is < 10 ,11111 > or < 01 ,11111 >. Neither is present in the above set. 

However, one observes that if the above tests are reversed then all single (and 
hence multiple) faults can be tested. Thus, the test set is the same in size, the 
number of times they need to be applied is more. 

Theorem 3; If a two-level circuit is robustly testable with respect to all path 
delay faults, then it is robustly testable with respect to all multiple stuck-at and 
stuck-open faults. (This theorem includes the results stated in Theorem 1, but 
Theorem 1 was stated separately because the number of tests proposed in Theo- 
rem 1 is smaller). 

Proof: As before (in Theorem 1), at first we apply tests for pFET stuck-open 
faults in the first level of the circuit. Since the two-level network is robustly 
testable with respect to all path delay faults, it is given to us that there exists a 
two pattern test <T\,T 2 > such that T\ and Tz differ in only one bit position. 
When applied in succession, they launch a 1-to-O transition at input x,-, of gate 
Gi, holding all other inputs of Gi and all inputs of G„ except the output of G; 
at a constant hazard free 1 in a fault-free situation. When the input sequence < 
TiJiJi > is applied to the network, it is clear from the discussion of Theorem 
1 that it tests for either 

(i) line stuck-at 1 fault at the input x,-, of G;, or 

(ii) TSOP fault of the pFET of G; connected to lead . 

If however, the tests fail to indicate an error even when faults may be present, 
then the faults invalidating this test must include input line stuck-at- 1 fault(s) 
in some other gate at the first level (and not any stuck-open fault since they are 
caught by Tz, T\ transition). Every two-level irredundant circuit is testable for 
multiple line stuck-at faults [12] by any single stuck-at fault test set. Thus, at 
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some stage of this three-pattern testing a stuck-at fault would be detected. If 
the circuit passes all the three-pattern tests, then there are no line stuck-at faults. 
Therefore, all the TSOP tests for the first-level pFETs are also valid tests, and 
there are no faults in the pFETs of the first-level gates. Hence, there are no 
faults in the nFETs of the first-level gates (they yield line stuck-at faults). We 
have also applied all the tests needed for Go- Hence, once it is verified that there 
are no faults in the first-level gates, it is clear that there is no fault in Go either. 

< Q.E.D > 

Though the testing philosophy is same, the test length in Theorem 1 is shorter 
than the test length in Theorem 3, owing to the fact that for unate functions, Vp, 
can serve as T\ for all the inputs through a gate G,-. In general, for non-unate 
functions we may not be able to find such an input that can serve as initializing 
input for all leads through the gate, thereby reducing the overlap between the 
tests. 

Lemma 1 [11]: Given F, a non-unate function and x,, a binate variable of F, 
there exists a combination of inputs xi,X 2 ,—,Xi-\,Xi+i,...,Xn such that Fxj = 1 
and Fxi = 0 ( F^. = 0 and Fj^ = 1 ). 

Proof: Assume that there does not exist an input such that F^c, = 1 and Fj; = 0 . 
This implies that for every input for which F;^, = 1, Fj, also equals 1. This implies 
that Fxi covers and hence Fj, can be written as Fj, = F^,. + F^. where F^, is not 
equal to 0 and depends only on variables jci ,X 2 , ,Xi+i , ...,x„. Hence, from 
Equation (1): 



P = XiFxi+XiFx; 

~ Xi{Fxi Pxi) XiFxj 

— XiFxi -l- Fxj (3) 

Equation (3) implies that F is unate in xi, a contradiction. Similarly, we can 
prove that there exists an input such that Fx^ = 0 and Fj, = 1, for non-unate 
function F and its binate variable x,-. 

< Q.E.D > 

Theorem 4: When a non-unate function F is decomposed with respect to a 
binate variable x/ in either of the forms given in (1) or (2) and implemented by 
circuits shown in Figure 1, then all multiple line stuck-at faults and TSOP faults 
in the resultant circuits are detectable by robust tests if these faults in the circuits 
realizing Fxi and Fx, are detectable by robust tests. 

Proof: Refer to the circuit shown in Figure 1(a). It was shown in Lemma 1 that 
there exists a combination of inputs for which F^, = 1 and Fje, = 0. Now, with 
this combination of inputs let us change x/ from 1 -> 0. The output of F is also 
expected to go from 1 -> 0. If this transition is observed at the output, then 
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(i) the nFETs of the output gate are fault-free, 

(ii) both the lines feeding the output gate are at 1. 

Note that Xi may only change from 0 to 1. Therefore, in presence of a fault 

XiFxf may change only from 1 to 0 and not in the reverse direction. Thus, the test 
performed above cannot be invalidated. If Xi is changed back to 1, a pFET of 
the output gate and the nFETs of the gate realizing are also tested robustly. 
Similarly, by choosing Fx^ = 0 and Fj. = 1, the lead jcj, the pFET and nFET 
connected to it and the other pFET of the output gate are tested. Testing the rest 
is simple, because if xi (x,) is held at 1, and the inputs to the circuit realizing 
Fxi (Fxi) are changed, the output Fx, (Fj,.) is guaranteed to be sensitized to the 
external output without any external interference. 



< Q.E.D > 

In Step 2 of Procedure_Testable_Design(F), if the two-level realization of a 
function is found to be robustly delay testable, then we do not have to mod- 
ify the circuit to achieve multiple fault testability (cf. Theorem 3). However, 
we may need to replace gates by gate trees to satisfy fan-in restrictions. The- 
orem 2 guarantees that all fan-in constraints can be met while preserving the 
testability of a circuit derived from Procedure_Testable_Design(F). If two-level 
realization of a function is not robustly delay testable then it is clear from Steps 
3 and 4 of Procedure Testable_Design(F), that it would be expanded with re- 
spect to a binate variable using Shannon’s expansion. Theorem 4 asserts that 
this expansion ensures multiple fault testability. Therefore, it is proved that Pro- 
cedure.Testable-Design(F) gives us multiple fault testable design. 

4. Conclusion 

Until recently, it was not known whether it is possible to design CMOS com- 
binational logic circuits that are testable with respect to all single transistor 
stuck-open faults. Earlier, the authors presented a design for arbitrary combi- 
national logic functions that require no special gates and technology constraints 
such as fan-in and fan-out are easily accommodated [11]. In this paper, it is 
shown that the design presented earlier actually results in circuits in which all 
multiple path delay faults and all multiple line stuck-at and transistor stuck-open 
faults are detected. The tests to detect such faults are also described. 
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Figure 1(a). F = XiFx, +x,Fj, 
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Figure 1(b). F = (x,- + Fj, ) (x,- + F^,. ) 





Figure 8. Tree realization of abcde 
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Abstract 

In this paper, a new method for circuit optimization in face of manufacturing process variations 
is presented. It is based on the characterization of the feasible design space by worst-case points 
and related gradients. The expense for this characterization is linear with the number of cir- 
cuit performances. On the contrary, the complexity of other geometric approaches to tolerance 
oriented circuit designs increases exponentially with the dimension of the design space. A deter- 
ministic optimization procedure based on the so-called “worst-case distances” will be introduced, 
combining nominal and tolerance design in a single design objective. The entire optimization 
process with regard to performance, yield, and robustness uses sensitivity analyses and requires a 
much smaller number of simulations than the Monte Carlo based approaches. Moreover, the new 
method takes account of the partitioning of the parameter space into deterministic and statistical 
parameter, which is an inherent property of integrated circuit design. 



1. Introduction 

The practical use of the large variety of existing methods for tolerance ori- 
ented circuit optimization [7, 6, 5, 15, 12, 11] within integrated circuit design 
has been obstructed by two major problems: the incomplete statistical data of 
manufacturing processes and the prohibitive simulation costs of optimization 
methods. While the increasing reliability of statistical measurements will solve 
the first problem in the near future, the high simulation costs remain the crucial 
limitation of methods for IC optimization despite increasing computer power. 
The statistical methods for yield optimization are often based on Monte Carlo 
analysis [3, 10, 16], which usually involves thousands of circuit simulations. 
Recent efforts aim at reducing the high computational costs for Monte Carlo 
circuit simulations by approximating the performance functions. Furthermore, 
the Monte C^lo based methods cannot readily be applied to IC design, as the de- 
sign space includes deterministic parameters. On the other hand, deterministic 
methods for design centering are often based on geometric concepts [7, 14, 9]. 
Beyond the yield maximization they aim at a general improvement of the cir- 
cuit’s robustness. However, their simulation costs increase exponentially with 
the number of parameters. 
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2. Basic Relationships 

Let / G denote the vector of circuit performances, such as delay, gain, 

and let z € denote the vector of circuit parameters, such as channel width, 

oxide thickness. In general, the performances f{z) for a certain set z of pa- 
rameters are not given explicitly and have to be computed by numerical circuit 
simulation, z is divided into the three subsets: statistical fixed parameters c, 
statistical adjustable parameters p, deterministic adjustable parameters x 

z^ = [J x^] = [s^ x^] = [J J] ( 1 ) 



where s G 1^"* denotes the vector of statistical parameters and a G denotes 
the vector of adjustable parameters. 

The inevitable variations in the manufacturing process are described by a 
probability density function pdf{s) of the statistical parameters. A usual pdf 
is the multinormal distribution pdf„{s,so,C), which is determined by the mean 
values So of th® statistical parameters and the covariance matrix C. The pdf„ is 
constant on n^-ellipsoids with center sq, as shown in Fig. 1. These n^-ellipsoids 
are called tolerance bodies Ti(P). 

T,(P,so,C) = {s G !?^"^|(s-so)^ -C-* • (s-so) < P"} (2) 

The volume of a tolerance body Ts can be determined by [1]: 

Vs{^,C)= f /'rfs = VditC-p"^-Vb (3) 

JTg J 

Vo is the volume of an n^-sphere with radius 1. While the circuit parameters 
have the defined statistics, the circuit performances are supposed to keep within 
certain bounds called performance specifications. The specifications are given 
as upper and/or lower bounds that define the acceptance region A/ in the perfor- 
mance space, as shown in Fig. 1. 

A,={nz)ev'\f<nz)<f] ( 4 ) 

Because of the mentioned nature of the performance function f{z), the accep- 
tance region As in the statistical parameter space and the tolerance bodies 7/ in 
the performance space (dotted curves in Fig. 1) are unknown. The acceptance 
of a parameter set can only be checked by simulation: 



f{z)eAf^seAs{xo)4^zeA^ (5) 

We now define the parametric yield Y as the probability that the performances 
of a manufactured circuit are within the specified bounds. 



y = prob{/(z) GA/} = ^ ^Jpdf{s)ds, 0<T<1 



( 6 ) 
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Figure 1. Tolerance body in the statistical parameter space with = 2 (cut of total param- 
eter space at value xq of deterministic parameters x) and acceptance region A/ in the performance 
space with ny = 2. 



Note, that the acceptance region in the subspace of statistical parameters 
is a cut of the total parameter space dependent on the nominal values of the 
deterministic parameters [8, 9]. 

3. The circuit optimization task in the 
performance space 

Consider the set of all performance bounds yf , i = 1, where yf can be 
a lower or upper bound B&{L,U}. Let fi{s,x) denote the performance function 
corresponding to yf , and let f{so,xo) € A/. Then, we define the performance 



tolerance a/ by 

= |y;^-y;(^o,^o)| (7) 

In order to formulate the circuit optimization task for an arbitrary initial point, 
we have to distinguish between satisfied and violated performance bounds at the 
nominal point. In the case of an upper bound we define 

(p/ = -t- 1 yf > fi{so,xo ) , (pi = -U^fi^ < fi{so,xo) (8) 

In the case of a lower bound fl' we define 

= fi{so,xo) , = fi{so,xo) (9) 

Then, the following assertion can be made: 

'^i<nB%{so,xo) = +1 ^ fizo) eA/^SoE As(xo) (10) 

In order to prepare the new approach to circuit optimization, we first repeat the 
problem of circuit optimization in the performance space. 

maxpo^„a(5o,-«o) , a= [...(p,(5o,Jfo)-a,(5o,-*o) -]^ (H) 
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Figure 2. Worst-case point s„ worst-case distance P„ maixmal tolerance body Tsj inside 
acceptance region Asj, gradient gs,i and approximation Asj to As^i determined by gsj and s,- for 
one specification jf (dashed lines: level contours of fi). 



Due to the tradeoff between the vector components of a, a unique solution to 
(11) usually does not exist. Multiple Criterion Optimization methods determine 
efficient solution points, i.e., points, from which it is impossible to improve one 
objective without making another one worse. 

4. Worst-case distances and related gradients 

In order to consider the variations of the statistical parameters s as well as 
the performance sensitivities with respect to s, the performance tolerances a,- 
are transformed into the subspace of statistical parameters a,’ — P,-. Due to the 
ambiguity in the solution of = fi{s,xo) in s, a decision has to be made. We 
choose the point si with the smallest distance P,- to sq. measured in a well-defined 
norm 1|M“’ • (s,- — so)||. using the weighting matrix as illustrated in Fig. 2. 

P/ = min^llAT * • - so)|| subject to ff = fi{s,XQ) (12) 

For the first time in [13], Mueller-L. has introduced a general definition or the 
worst-case point s;. Mueller-L. applies the worst-case points to the solution 
of an analysis task. He calculates the worst-case points approximately, calling 
them “limit parameters”, and uses them for the characterization of the worst- 
case problem. In this paper, the idea of limit parameters is adopted. We cal- 
culate the limit parameters exactly by solving the standard optimization prob- 
lem (12) and call them “worst-case points”. Furthermore, the worst-case dis- 
tance P,- = ||M“' • (sj - 5o)|| and the performance gradients at these points will 
be applied to a deterministic optimization procedure. The concept of worst-case 
distances is particularly suited for the characterization and solution of the cir- 
cuit optimization problem and leads to a unified view of nominal and tolerance 
design. 
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To the norm \\M * • (j - sq) II the norm body Ng can be associated. 

Ng{^,so,M) = {se • (^-^o)|| < P} (13) 

In order to represent the parameter statistics in problem formulation (12), we 
define the norm to be used in (12) such that the tolerance bodies of the pdf 
equal the norm bodies of (13). For a normal pdf, the equality of (2) and (13) is 
achieved by using a weighted / 2 -norm according to the Cholesky decomposition 
of the covariance matrix C = M • 

p = 1|M“‘ • {s - So) II 2 = ^J{s- So)^ • C-t • (s - 5o) (14) 

Fig. 2 shows the level contours (dashed lines) of a performance ft correspond- 
ing to one upper performance bound in the statistical parameter space, the 
tolerance bodies Tg (ellipses) according to (2), (14), and the result of (12). 
The bold dashed line represents the parameter sets for which the constraint 
fi = fP in (12) is satisfied. It is the boundary of the acceptance region Aj,, (xo) = 
{5 G <fi^} determined by the single specification Generally, 

we define 

Ag,i{xo) = {s € f^"*|(pi(s,xo) = -1-1} (15) 

The solution of (12) for a single bound yields the worst-case point s;, the 
worst-case distance P,-, the corresponding maximal tolerance body Tg^i, inside 
i4^,,(jco) (shaded ellipse), the volume Vg^i{xo) of Tg^t, and the yield T; according to 
the one-dimensional marginal distribution 

Yi = prob |s e Aj,i(xo)| = J l/V^-exp{—t^/2)dt (16) 

With the results s/, P„ Tg^i, and Yi of (12), a nominal design can be rated with 
respect to performance, worst-case, yield, and robustness. 

Considering (2) and the properties of a pdf, it follows that (12) is equivalent 
to the following problem formulations 

muxgpd f{s) subject to /f = fi{s,xo) (17) 

maxpF^(P) subject to T^(P,5o) C A^,,(xo) (18) 

While (18) aims at maximizing the volume of a tolerance body lying completely 
inside A,,;, (17) defines the worst-case point Sj as the point with highest proba- 
bility among the statistical parameter sets violating fP. 

Note, that the presented derivations are valid for various pdfs and norms, as 
illustrated by Fig. 3. Obviously, the classic worst-case analysis [14] is included 
by using the /«,-norm in (12). 
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Figure 3. Points s,- on a hyperplane with smallest distance P,- = \\M * • (s,- — 5o)|| to point sq 
and associated norm bodies Ns according to different norms. 

In the following, we will state the gradient of the performance distance P; 
with respect to the nominal parameter values (soj^) for a normal pdf [2], which 
is essential for the solution of the circuit optimization task formulated in the next 
section. 

The necessary conditions for the solution of (12) with the norm of (14) can 
be derived using the Lagrangion function of (12) 



L{s,xq) = _ 2X • (/i(5,xo) - ff) -V min (19) 



For the stationary point s-, of (19), the following conditions are necessary 



fl{Si,Xo) = ff 
M~^-{si-so) = Xi-hf-gs^i 

using the abbreviations for the performance gradients 

gz,i = 3/(z)/3z|^r=^7-=[,r^r] 

sh = Ki sl,i sli] = [sli sli] = [sli sli. 



( 20 ) 

( 21 ) 

( 22 ) 



In the course of the solution of (19), considering (20), (21), and (22), the gradi- 
ent of the performance distance P,- with respect to the nominal values oq of the 
adjustable parameters can be determined as 



aPi ^ -sign(Xf) 



(23) 



d^i/doQ is a function of the performance gradient and thereby is an inherent 
quality of the deterministic solution of (12). 

Note, that the partitioning of the parameter space is naturally included in (23). 
Note also, that for the classic worst-case optimization the norm in the denom- 
inator of (23) is the Ii-norm (dual norm to L-norm). 
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Figure 4 - Acceptance region As spanned by three performance bounds /j^, /^, leading to 
three worst-case points s\, 52, 53 (left: initial design, right: optimized design). 



5. Circuit optimization based on worst-case 
distances 

Fig. 4 illustrates the statistical par^eter space for three performance bounds. 
Obviously, the acceptance region As can be characterized by the worst-case 
points Si and the performance gradients with an expense linear to the number 
of specs. 

A,= (24) 

i</iB i<ns 

We can see that no consideration of the convexity of As is necessary. As can be 
used to estimate the yield without performing circuit simulations according to 

Y = prob I j e As{xq) I (25) 

Based on the worst-case distances and their gradients introduced in the previous 
section, we now formulate the problem of circuit optimization in the parameter 
space. 

maXpo^P(so,JCo) , P = [- <Pi{so,xo) ■ P,(so,-«o) -V (26) 

(26) requires the maximization of all tolerance volumes with respect to all per- 
formance bounds yf. As in (1 1), the tradeoff between the vector components of 
P has to be considered. 

Thereby, (26) optimizes performance, yield, worst-case, and robustness of a 
nominal design. 

In order to consider all these aspects, we finally formulate the problem of 
circuit optimization with a scalar objective function to be: 

minpo^Y^ • Y , Y = [••• exp(-(pi(5o,Jco) • Pi('So,J^o)) -V (27) 

From (27) we can see: when the performance bound is satisfied, the correspond- 
ing component of y obtains a value between 0 and 1, when it is violated, it ob- 
tains a value greater 1. In that way, all violated specs obtain an adequate penalty 
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preliminary 


optimized 






<pP( 


Yi 


(p.Pi 


Yi 


(pP/ 


% 


fL> 

^Alh 


5.36 


100% 


4.02 


100% 


3.17 


100% 


1.08 


86% 


2.41 


99% 


3.15 


100% 


6.32 


100% 


3.04 


100% 


2.79 


100% 


-4.45 


0% 


2.33 


99% 


2.83 


100% 


^total 


0% 






99% 


100% 




Uhl 


3.5 






3.0 


2.8 




Ulh 


7.9 






3.0 


2.9 




Wp,Wn 


4.0/1, 4.0/1, 




11.6/1, 4.9/x 


12.4/x, 5.4/z 



Table 1. CMOS inverter optimization progress. 



weight, and we can start the optimization procedure with an arbitrary initial 
value of flo- 

(27) corresponds to a least squares optimization problem, for which efficient 
solution techniques have been developed. 

The interactive trust-region method presented in [4], which is based on qua- 
dratic programming, is especially well-suited for circuit optimization according 
to (26) and (27). It is based on gradient information of the objective functions ac- 
cording to (23) and is especially well-suited for ill-conditioned problems. Fig. 4 
shows an initial and optimized design according to (27) with equalized worst- 
case distances and tolerance bodies, respectively. 

If the parameter distribution is not known, we set: C = I (identity matrix). 
Then, (27) fulfills the pure nominal optimization with respect to performance 
sensitivities. 

6. Examples 

The first example is an integrated CMOS inverter from a standard cell li- 
brary. We will inspect the delay times tdu and tdih for the falling and rising 
output slope with the specification bounds = 2 and = 4. 

For the underlying manufacturing process, the variations and correlations of the 
transistor model parameters are determined as in [13]. In total, 1 1 device param- 
eters are considered, which constitute the set of statistical fixed parameters. The 
channel widths Wp and of the p- and n-channel transistor constitute the set of 
deterministic adjustable parameters. Typically, in IC design there are no statisti- 
cal adjustable parameters. Table 1 shows the nominal performances, worst-case 
distances, and estimated yields provided by the new method for the initial, a pre- 
liminary, and the optimized design according to (27). The worst-case distances 
of the initial indicate that is violated ((p = —1) and that the nominal point 
is closer to rj),; than to The latter cannot be learned from the yield values 

which are both 100%. The performance values of a preliminary design achieved 
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Figure 5. Performance space of CMOS inverter with two delay times, optimized design with 
nominal point /o and worst-case performances fi associated to worst-case points s,. 



during the design procedure are right between the lower and upper bounds, the 
corresponding yield value has reached approximately 100%. Thus, the prelimi- 
nary design is optimal with respect to the yield optimization problem. However, 
the Pi show that the optimum with respect to the circuit’s robustness has not 
been reached yet. For the optimized design, the Pi have been equalized. We 
can see that taih is somewhat more sensitive than t^hi (smaller value of Pi) and 
that the nominal performance point is not in the middle of A /. Fig. 5 shows 
the performance space for the optimized design including the worst-case perfor- 
mance values. A comparison with a genuine Monte Carlo analysis shows that 
the estimated yield values are very precise (1-2% difference). 

The second example is an SC-filter, described in [3], which can be investi- 
gated by others and which is known to be very ill-conditioned. It is characterized 
by 12 statistical adjustable parameters (no fixed or deterministic parameters), 44 
performances, and 73 spec bounds. In [17], this example had been optimized 
by several methods, including sophisticated Monte Carlo based methods, with 
a best optimization result of 60% yield (initial yield was 24%). We have ap- 
plied our new method to this circuit and achieved an optimized design with the 
parameter values given by [... pi ...] = [0.1972 0.0874 0.3141 0.3052 0.0508 
0.1771 0.1976 0.0990 0.5518 0.3593 0.1269 0.2124] and a yield of 64% at a 
total expense of 80 sensitivity analyses. Thus, the presented method obtained a 
better result than statistical methods at lower simulation costs. 

7. Conclusion 

In this paper, a new approach to circuit optimization has been introduced, 
based on worst-case distances related to the manufacturing process variations. 

For each performance specification, there exists exactly one characteristic tol- 
erance body touching the performance specification in the worst-case point. The 
number of worst-case points equals the number of performance specifications. 
The acceptance region in the parameter space can be characterized by the worst- 
case points with a linear expense due to the number of specs. In contrast to this. 
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other geometric approaches to tolerance design have a complexity exponentially 
increasing with the number of parameters. 

The new design target “worst-case distance” measures the distance of the 
nominal point to the worst-case point in terms of the underlying manufacturing 
process. For each performance, there exists exactly one worst-case distance, 
which characterizes the robustness margin and the yield due to one performance 
spec. 

A unified approach to the nominal and tolerance design problem is made pos- 
sible by the worst-case points, the worst-case distances, and the related gradi- 
ents. The new method for circuit design enables an extensive problem diagnosis 
and includes the partitioning of the parameter space in deterministic and statis- 
tical, adjustable and fixed parameters. 

The new approach includes both yield optimization and design centering and 
can be established by a deterministic optimization procedure. Compared to the 
Monte Carlo analysis, equivalent optimization results are obtained at signifi- 
cantly lower simulation costs. 

In the case that no statistical parameter distribution is given, the new method 
fulfills the pure nominal design task including the consideration of performance 
sensitivities. 
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Abstract 

Many recent papers have formulated both timing verification and optimization as mathematical 
programming problems. Such formulations correctly handle level-sensitive latches, long and 
short path considerations, and sophisticated multi-phase clocking schemes. 

This paper deals with the computational aspects of using such a formulation for verifying clock 
schedules. We show that the formulation can have multiple solutions, and that these extraneous 
solutions can cause previously published algorithms to produce incorrect or misleading results. 
We characterize the conditions under which multiple solutions exist, and show that even when 
the solution is unique, the running times of these previous algorithms can be unbounded. By 
contrast, we exhibit a simple polynomial time algorithm for clock schedule verification. The 
algorithm was implemented and used to check the timing of all the circuits in the ISCAS-89 
suite. Observed running times are linear in circuit size and quite practical. 



1. Introduction 

An elegant mathematical programming formulation for timing verification 
and clock schedule optimization was presented in [1]. The formulation, hence- 
forth referred to as the SMO formulation, handles both long and short path 
constraints, deals correctly with both edge-triggered and level-sensitive latches, 
handles circuits with complex, multi-phase clocking schemes, and is sufficiently 
flexible to handle various system constraints that might be externally imposed 
upon the circuit being analyzed. It also provides a rigorous framework upon 
which one can discuss convexity of solution spaces, robustness of solutions, etc. 
For these reasons, several recent papers have built upon this model to treat more 
advanced timing problems including retiming [2] and wave-pipelining [3]. 

In the SMO formulation, the timing properties of a circuit are modeled using 
the parameters and variables appearing in Figure 1. In this paper, we shall deal 
only with level-sensitive latches, the extension to edge-triggered flip-flops being 
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Parameters 

n number of latches in circuit 
Pi clock phase controlling latch / 

Si setup time of latch i 
Hi hold time of latch i 

6ij minimum combinational delay from latch i to latch j 
Aij maximum combinational delay from latch i to latch j 

Variables defining the clock schedule 

71 clock period 

Wi length of time that clock phase i is active 
€i absolute time within period when phase i begins 

Other variables 

Eij time between start of phase i and next phase j 
Ui earliest signal arrival time at latch i 
Ai latest signal arrival time at latch i 
di earliest signal departure time from latch i 
Di latest signal departure time from latch i 
Equations and constraints 



di = mdx{ai,n-Wp.) 

Di = max(A/, 7t — Wp. ) 

ai = min j-^i{dj + 8y,/ — Ep.^p.) 

Ai = mdXj-^i{Dj-\-Aj^i — Epj,pi) 

ai>Hi 

Ai < n-Si 



Figure 1. The SMO timing model for level-sensitive latches. It may easily be extended to 
accomodate edge-triggered flip-flops. 

a straightforward exercise. We identify the latches of the circuit with the integers 
from 1 to n. The symbol denotes the “fans out to” relation on latches, that 
is, i j if and only if there is a path of strictly combinational logic elements 
extending from the output of latch i to the input of latch j. Without loss of 
generality, we assume that each latch has at least one fanin, that is, for every j, 
there exists an i such that i j. The term path henceforth means a sequence of 
k>0 latches, jo , . . . , jh such that ji 7/+1 for 0 < / < A:. 

The clock schedule optimization problem asks us to find the minimum value 
of n for which there is an assignment to all variables consistent with the con- 
straints in Figure 1. The clock schedule verification problem, on the other hand, 
presents us with values for 71 , w and e and asks us to find values for the rest of the 




€j-ei if/<; 
n-\-ej-ei otherwise 
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Variables determined directly by clock chedule 

Ei =[ 

1 n+ej-ei otherwise 
Bi = n— Wi 
^jJ — ^jJ ~ ^Pj,Pi 

^j,i — ^jJ ~ ^Pl,Pi 

Equations defining arrival and departure times 

di = max(a,,B,) 

D, =max(i4,,B/) 

ai = imaj-^i{dj + Xjj) 

Ai = maxj^i(Dj+Ajj) 

Constraints for correct operation 

a, > Hi 
Ai<n-Si 



Figure 2. The SMO timing model as simplified for clock schedule verification. 



variables so as to satisfy the constraints. Since this paper deals exclusively with 
the schedule verification problem, we can simplify the SMO constraints to bet- 
ter suit our purpose. We introduce the auxiliary variables B, X, and A as shown 
in Figure 2. It is easy to see that B, X, and A are uniquely determined by the 
clock schedule, as is £. We may therefore recast the clock schedule verification 
problem as asking whether there exist a, d. A, and D that obey the equations and 
satisfy the constraints appearing in Figure 2. 

The following lemma will be invoked many times throughout this paper. 

Lemma 1.1 Let (a,d,A,D) be a solution to the equations. Let jo, ...,jk be a 
path with k>0. Then Dj^ >Aj^>Dj^-\- Sf=o ji+i • Moreover, if jo = jk, then 

Proof. From the equations, D,- > A,- > Dj -1- Ay_,- for any i and j for which 
j -> I. In particular, Dj^> A j^> Continuing the substitution, 

Djt> A j^> + Sf=o^j, j/+i- Setting Dj^= Dj^ and simple manipulation 

complete the result. 0 

The following observation, although elementary, is used so frequently 
throughout this paper that it is worth stating explicitly. 

Fact 1.2 The functions min and max are both monotonic in their arguments, 
that is, increasing the value of any argument cannot decrease the value of the 
function, nor can decreasing the value of any argument cause an increase in the 
value of the function. 
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Finally, a solution {a* ,d* ,A* ,D*) is called a minimum solution, if for any 
other solution {d,d,A,D), we have a* < di, d* < di, A* < At, and D* < D/, for 
all i. 

2. Solving the Equations 

Iteration is a popular method for solving a set of equations V = £(F) over 
a set of variables V = {V} | 1 < 7 < n}. One begins with a trial “solution” 
and computes a succession of iterates, = T,{V'), until convergence occurs, 
yi+i — yt^ which point V‘ is a solution to £. When the equations are formed 
from monotonic operators, as in our case, the entire sequence of iterates will 
be monotonic nondecreasing or nonincreasing, provided we choose the initial 
iterate properly. Thus, when V® < V* componentwise (that is, V® < Vj for all j) 
we are guaranteed that <V^ <V^ • • ■ . Similarly, when V® > F* , we will have 
yO > > y 2 . . . jjjg nionotonicity of the sequence of iterates can be used to 

prove convergence provided a suitable upper (or lower) bound is known for the 
solution. 

For clock schedule verification, we can partition the equations into two inde- 
pendent subsets, one subset (the earliest time equations) involving the variables 
ai and di, and the other subset (the latest time equations) involving the variables 
Ai and Di. By choosing appropriate starting points for either subset, we can 
choose whether we converge upwards or downwards to a solution. The choice 
of starting point is crucial; it affects both the correctness of the result and the 
time needed for the process to converge. We shall show that the best approach 
is to converge upwards in both sets of equations to what will turn out to be the 
minimum fixed point of the verification equations. The rest of this section de- 
fines this technique more precisely, proves that it yields a minimum solution if 
any solution exists at all, and establishes that the method runs in polynomial 
time. 

Construction 2.1 For a given clock schedule, compute E, B,X, A as described 
above. Then, for all m>0, define 



It is easy to see that this process is well defined. Simply compute A^ and a^, 
then and d^, then A' and a\ etc. The values of the variables turn out to be 
monotonic nondecreasing in m, as shown in the following lemma. 




min 7 ->i(Jy'~’ -f- X;,,) ifm > 0 
max ;_^,(D7”* -I- A; /) ifm > 0 
max(af ,B,) 
max(Af,fl,). 



— 00 

— 00 
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Lemma 2.2 (Monotonicity) For all i and w > 1, of > of ^ Af > 

and Df > lyj'-K 

Proof. By induction on m. We will only provide the argument for df and df, 
the argument for Af and being similar. For the basis with m = 1, we have 
a,- > because a? = — Moreover, df = and dj > Bi, so d] > df. 

For the inductive step, assume the lemma is true for m - 1 with m > 1. By the 
monotonicity of min and the inductive assertion, we have miny_^,(£?J’“* +^;,i) > 

mmj^i{dj~^ + Xj^i). By definition, of = miny_^,(rfj“’ +^y,i) and = 
minj-^i{dj‘~^ + Xj^i), so of > The monotonicity of max then implies that 
max(a”,B,) > max(af“*,B,), and so df^ > df^~^. □ 

The next lemma shows that the iterated variables approach a solution from 
below. As a consequence, if the process converges, it will converge to a solution 
which is a minimum solution. 

Lemma 2.3 Let {d,d,A,D) be a solution to the equations. Then for all i and 
m>0,df< di, df' < di, Af < Ai, andldf'^ Di. 

Proof. By induction on m using an argument nearly identical to Lemma 2 . 2 . 
We will only provide the argument for dp and df*, the argument for Af and Df 
being similar. For the basis with m = 0, we have di because af = -<». 
Moreover, df = Bi, and di > Bi directly from the equations, and so rff < di. 

For the inductive step, assume the lemma is true for m - 1. By the 
monotonicity of min and the inductive assertion, we have rdmj^i{df~^ + 

Xj,i) < min;- 4 ,(J; + By definition, df = ramj-^i{dj~^ di = 

miny_^/(Jy so df < di. The monotonicity of max then implies that 
max(a^,B,) < max(a,-,B,), and so df < di- □ 

As the iteration proceeds, the values of the variables increase. The next 
lemma will be used to relate each such change to some latch which is “to blame.” 

Lemma 2.4 For any i, ifm > 1 and df > df~^, then there exists a j such that 
j i, and df =af < dj~^ Moreover, ifm> I, then df~^ > df~^. 

Proof. Since df > df~^ by hypothesis, 

df = max(af’,B,) > max(af“\B,) = df~^. 

Lemma 2.2 then implies that df = df and df > df~^. 

Case 1: m = 1. By definition, aj = minp-^i(df + Xpj). Pick j such that j i 
and df+Xj^i = minp_>j(</p + Ap,,). Then aj < df +Xj^i, as required. 

Case 2: m > 1. Since df > df~^, we have vdmp^i{df~^ +^p,») > 
vomp^i{df~^ -\-Xp^i). Clearly, there must exist at least one p for which p i 
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and Take j = p. Then of = + Xp,i) < dj~^ + Xj,i, 

satisfying the lemma. □ 

The next lemma is the key technical result of this section. It shows that the 
iterations need not be continued indefinitely. More specifically, it shows that if 
the earliest time equations do not converge within n iterations, then the latest 
time equations do not converge at all. 

Lemma 2.5 Ifdf ^ for some i, then the equations have no solution. 

Proof. By Lemma 2.2, d" > df~^. Construct a path jo,...Jn by setting 
j„ = i and then applying Lemma 2.4 n times to obtain _/„_i , . . . , jo. Then, for 
each k, 0 <k <n, v/e have jk Jk+i and < d^j^ +^jkJk+r Moreover, 
for 1 < ^ < n, we have > d^f^. Accordingly, for any p and q with p <q, 
we have d^j^ < dj^ + • We also have dj^ > d^~^, which implies by 

Lemma 2.2 that d^j^ > Since each Ji is an integer between 1 and n inclusive, 
we can pick p and q so that jp = jq. Thus Hence, 

0 < llZl'^jkJk+v implying that 0 < 'Ll=l^JkJk+i • However, Lemma 1.1 applied 
to the subsequence jp,..., jq, says that if a solution exists, we must have 0 > 

^k=p^JkJk+i ■ Therefore the equations are unsolvable. □ 

The next two lemmas do for the variables A and D what the previous two 
lemmas did for a and d. Note that the statements of the lemmas are very close, 
but not identical, to the earlier lemmas. 

Lemma 2.6 For any i, if m>\ and > Df~^, then there exists a j such that 
j — )• i, and Df =Af — + Ay,,-. Moreover, ifm > 1, then > Dj~^. 

Proof. Similar to Lemma 2.4. □ 

Lemma 2.7 Ifiyi ^ for some i, then the equations have no solution. 
Proof. Similar to Lemma 2.5. □ 

Next we show that Construction 2.1 converges to a solution if it converges at 
all. 

Lemma 2.8 Let p and q be integers such that df~^ = df for every i, and = 
for every i. Then {aP ,d^ ,Af ,LH) is a minimum solution to the equations. 

Proof. It is trivial to verify that {aP ,d^ ,A'^ ,Tfi) satisfies the equations. 
Lemma 2.3 shows that the solution is a minimum solution. 0 

Finally we arrive at the key result of this section, namely, a polynomial time 
algorithm for solving the equations. 
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Figure 3. A circuit and clock schedule for which the SMO equations have multiple solutions. 

Theorem 2.9 If the equations have a solution, then they have a minimum solu- 
tion. Moreover, this minimum solution can be found in O(n^) time. 

Proof. Clearly we can compute the vectors rf" and £)" in 0{n^) time. In 0{n) 
additional time, we can apply Lemmas 2.5 and 2.7 to test whether any solution 
exists. If one does, then {(f,d",A",D") is a solution. Moreover, by Lemma 2.8, 
it is a minimum solution. □ 

The reader will note that our suggested solution to the latest time equations 
is equivalent to the Bellman-Ford algorithm for finding shortest paths in a di- 
rected graph with positive and negative edge weights. Ishii [4] has previously 
advocated this approach as well. 

3. Uniqueness 

The algorithm of the previous section can be used to find a solution to the 
equations. Clock schedule verification can then be performed by checking 
whether the solution obeys the constraints on setup and hold times listed in Fig- 
ure 2. This of course raises an important question, namely, what if the equations 
have multiple solutions? Some of the solutions might satisfy the constraints, 
while other solutions violate them. For example, consider the circuit and clock 
schedule shown in Figure 3. The corresponding SMO equations and constraints 
become 

Xi,i=0 Ai,i=0 

ai=di M=Di 

di=max(ai,2) £>i =max(Ai,2) 

ai>3 Ai<8. 

For any x > 2, Ai = Di = ai = = x is a valid solution to these equations. 

.Solutions having x < 3 exhibit a hold violation, solutions having x > 8 exhibit 
a setup violation, and solutions with 3 < x < 8 are free of violations. 

Thus we see that solutions to the SMO equations are not, in general, unique, 
and different solutions can give differing results vis-a-vis the timing constraints. 
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In this section, we will characterize the circumstances under which multiple 
solutions exist. More specifically, we will show that multiple solutions can only 
exist at the optimal clock period, and are due to cycles of zero aggregate delay 
in the circuit. 

Throughout this section, we will use {a,d,A,D) to denote an arbitrary solu- 
tion to the equations, and {a*,d*,A*,D*) to denote the minimum solution to the 
equations. We begin with a pair of lemmas needed in the proof of the theorem 
which is to follow. 

Lemma 3.1 If di> d*, then there exists a j such that dj > dj, j i and di < 
dj + ’kj,i- 

Proof. By hypothesis, di = max(a,-,B,) > max(<i*,B,) = d\, implying 5/ > a*. 
Thus, di = minp^i(Jp + A,p,i) > minp_>,(J* + Xp,i) = a*. Hence we can find a j 
such that j i and dj>dj. Moreover, a,- <dj + ‘kj^i. □ 

Lemma 3.2 If Di > D*, then there exists a j such that Dj > D*j, j i and 
E>i = Dj + Ayy. 

Proof Similar to Lemma 3.1. By hypothesis, D,- = max(A/,Bi) > 
max(A*,B,) = Df implying Ai > A*. Thus, Ai = maXp_^,(Dp + Ap,;) > 
maXp_>,(Dp -I- Ap,i) = A*. Take j such that j -)■ i and Dj -f- Aj^i = maXp^,(Dp -I- 
Ap,,). Certainly Dj > Df Moreover, Ai = Dj + Aj^i. □ 

We are now ready to show that multiple solutions to the equations are due to 
the presence of zero weight cycles in the circuit. Such a cycle represents a path 
through the circuit in which a signal returns to some latch at precisely the same 
time (relative to the clock period) as it left that latch. 

Theorem 3.3 If the equations have more than one solution, then there exists a 
path jo, ...,jk,k>l, such that jo = jk andO = Sf=o 

Proof. Let {d,d,A,D) be a solution other than the minimum solution 
{a*,d*,A*,D*). 

Case 1: dm > d^ for some m. Construct a path jo, by setting = m 
and then applying Lemma 3.1 n times to obtain y'n-i, . . .,jo- Then, for each i, 
0 < / < n, we have ji ;',+i and dj^^^ < dj^ + • Clearly, this path must 

contain repeated indices, so pick p and q such that p<q and jp = jq. The path 
that will satisfy the lemma is jp,...,jq. Substituting, dj^ < djp+'LUl'kjiJiH- 
Thus 0 < SfJp < Sf=p By Lemma 1.1, we conclude that 0 = 

A . . 

Vi J/+1 * 

Case 2: Dm > Dm for some m. Use Lemma 3.2 to construct a path jo, ...,jn 
with j„ = m, such that ji ji+\ and Dj|^^ = Dji+Aj|Jl^^ for each i, 0<i<n. 
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Pick p and q such that p <q and jp = jq. Now consider the subpath jp, jq. 
Clearly, Dj^= Dj^ + Ay. , and we have 0 = ‘ . 

Case 3: dm = dl, and Dm = for every m. This implies that dm = 

Am = Al, for every m, in contradiction to the assumption that the two solutions 
are distinct. 0 

An immediate consequence is that non-unique solutions are a phenomenon 
that can only occur at the optimal clock period. Said another way, the equations 
have a unique solution whenever the clock schedule has a suboptimal period. 

Corollary 3.4 If the equations have more than one solution for a clock schedule 
with period n, then n is optimal, that is, no valid schedule has a period less than 
n. 

Proof Let jo, fulfill the conditions of Theorem 3.3. Then 

k-l k-\ k-1 

^ ~ S ~ X ^PJpPJi+l ’ 

1=0 1=0 1=0 

which may be further rewritten as 

k-\ Jt-1 

^ X ~ X (^PJi+i ~ ^Pj) ~ 

i=0 1=0 

where c is the number of i, 0<i<k, for which pj^_^^ ^ Pjr Observing that 
k-l k k-l 

X (^PJi+i ~ ^PJi ) “ X ^PJi ~ X % = ^PJk ~ ^PJo ~ 

1=0 i=l 1=0 

we can conclude that 



k-l k-l 

® = X ^JhJi+l — X ~ 

1=0 (=0 

Reducing n would certainly violate Lemma 1.1, and so no solutions exist for 
smaller values of %. □ 

Having shown that zero- weight cycles are a necessary condition for multiple 
solutions to the SMO equations, we shall next show that they are also sufficient. 
WeTl do this by showing how such a cycle in a solution can be used to construct 
other solutions as well. The basic idea is to add some 8 to all the A,’s and D,’s 
along the cycle. In general, this alone will not yield a solution. However, the 
“perturbed” values can be used to reinitialize the algorithm of §2, which can then 
be used to converge upwards to a new solution. Before proceeding, we note a 
consequence of the previous theorem. 
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Corollary 3.5 Let {a,d,A,D) be a solution to the equations and let jo, ...,jkbe 
as in Theorem 3.3. Then 0<i <k. 

Proof. First, suppose that some Dj^ ^ Aj^. Renumbering the /s if necessary, 
assume that it is Dj, ^ Aj^. By Lemma 1.1, Dj, > Aj, > Dj^ + Sf=o = 
DjQ=Dj^. This clearly implies that Dj^ = A 

Second, suppose that some Dj^ =Aj. Dj,_^ Once again, renum- 

bering if necessary, assume that Dj^ = Aj, Dj^ +^ 7 o,;i- ®y Lemma 1.1, 
Dj, > L>yi Thus Dj, > L>yo + Sj=o Ay. = Dj,. This is im- 

possible since jk = jo- D 

Construction 3.6 Suppose that {d,d,A,D) is a solution to the equations and 
that joi---i jh k>l, is a path for which jo = jk and 0 = Sf=o A;/ jj+r Let C be 
the set {y, |0 < i < ^} and let e be any positive real. For all m > 0, define 

a9 = I ifi^C 

' [ Ai otherwise 

£)f = max(Af,5,) 

Lemma 3.7 

AO _ r A® = D,' -t- e = A, + e ifiEC 
‘“\A- ifi^C 

Proof. Suppose that i G C. Corollary 3.5 tells us D,- = max(Aj,fli) = Aj > Bi. 
Hence, = max(A?,Bi) = max(A,- -1- z,Bj) = Ai -f- e. Since A = A,-, we have 
bf = Di + e. 

Next suppose that i 0 C. Then = max(A9,5,) = max(A,5,) = A- D 

The next several lemmas show that the iterations performed on the perturbed 
solution yield monotonically non-decreasing values, must converge to another 
solution within n steps, and does not affect the A/ and Dj values for those latches 
i that are on the cycle C. 

Lemma 3.8 For all i and m>l,Af> A”“\ and > Df~^. 

Proof. By induction on m. For the basis, m = 1. First consider A}. There are 
two cases, depending on whether i lies on the cycle. 

Case 1; i ^ C. Lemma 3.7 tells us that Dj > Dj for all j. Then Aj = 

max;-^,(D5-fA;,,) > maxy^i(A + A;,,) =A,- =A9. 

Case 2: i € C. By Corollary 3.5, there exists some p eC such that 

Ai = max(Dj + Aj^i) =Dp + Apj. 



ifm > 0 
ifm > 1. 
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Aj — -i-Ajj) > j&p + Ap, i — Dp + £ + Ap^j. 



Together, these imply that 



Af >A, + £ = a9, 



as was to be shown. 

Having established that A} > Af, it is easy to see that = max(A/ ,B,) > 
max(A9,Bj) = by the monotonicity of max. Thus t>] > D^. 

For the inductive step, assume the lemma is true for m - 1 with m > 1. By 
the monotonicity of max and the inductive assertion, we have maxj_», (D^ * + 

A;,j) > maxy_>j(^“^ + Ay,,). Recall that Af = maxy_>,(^~^ +^y,i) 
Af~^ = maxy_n(^”“^+Ay,,), so substitution gives A^ >Af~^. The monotonic- 
ity of max then implies that max(A^,B,) > max(A”“*,B,), and so £)f > Df~^ 
as was to be shown. □ 

Lemma 3.9 For any i, if m > 1 and t/f > Df~^, then there exists a j such that 
j i, and Df = Af = + Ay,;. Moreover, ifm > 1, then 

Proof Analogous to Lemma 2.6. Since t)f > by hypothesis, tf = 
max(Af ,B,) > max(A”“*,B,) = Lemma 3.8 then implies that Df =Af 
and A” > Af”*. 

Case 1: m = 1. By definition, Aj = maxp^,(i^p + Ap,,). Pick any j such that 
j i and £fj -f- Ay,j = maxp_>,(^° + Ap,,) . Hence A} =D^+A 

Case 2; m > 1. Since Af > Af”*, we have maXp_>,(iDf”* -l-Ap,,) > 
maxp_>,(^”^ -i- Ap,,). Pick any j be such that j -> i and ^”* -I- Ay,,- = 
maXp_t,-(^”* -h Ap,,). By our choice of j, tyf =Af = i^”* -H Ay,j. Moreover, 

^-* > □ 

Lemma 3.10 For any i 6 C, ifm > 1 then Df = Di + e. 

Proof Since D,- + 8 = D-* by Lenuna 3.7, it suffices to show that Df = D-* 
for all m. Suppose the contrary, that is, suppose that £)f > for some m > 1. 
Accordingly, there is some ^ > 1 such that > i^”*. Apply Lemma 3.9 to 
construct a path jo, ...,Jk such that 

p=0 
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Moreover, Lemmas 3.8 and 3.7 tell us that &■ > = £>i + e, and so we have, 

Di + ^<D%+'Z^jpJp+r 

P=0 

Next, apply Lemma 1 . 1 to path jo,...,jktosQe that 

k-l 

Di>Djo+ Yj^jp^Jp+r 

p=0 

Thus, > Djg + z, contradicting Lemma 3.7. □ 

Theorem 3.11 (converse of Theorem 3.3) Let jo, ...,jic, k>l, be a path such 
that Jo = jk and 0 = Sf=o Then the equations either have no solutions, 

or else they have infinitely many solutions, and the solutions are unboundedly 
large. 

Proof. If the equations have any solution at all, then we can apply Construc- 
tion 3.6 to that solution. We claim that ^ for all i. It is easy to show 

this using an argument along the same lines as the proof of Lemma 2.7. That 
is, assume the contrary, apply Lemma 3.9 to produce a path passing through the 
same node twice, and from there show the existence of a cycle of positive weight 
contradicting Lemma 1.1. 

It is now straightforward to verify that {a,d,A'',LF') is a solution to the equa- 
tions. Moreover, each value of 8 used in Construction 3.6 gives rise to a different 
solution, as readily seen from Lemma 3.10. We conclude that there are infinitely 
many solutions to the equations. □ 

Corollary 3.12 If there are more than one solution to the equations, then there 
are solutions that violate the setup constraints. 

Proof. Simply pick 8 large enough. □ 

Theorem 3.13 The uniqueness of a solution can be determined in 0{n^) time. 

Proof. Let {a,d,A,D) be a solution to the equations. Define the relation 
determines (symbolized >) by jt> i if and only if j i and D; = Dj+Ajj. We 
claim that the equations have multiple solutions if and only > is cyclic, that is, 

if and only if there exists some i for which i > i. 

For the “if’ part, suppose that > is cyclic. Then we can find a sequence 
Jo, with ^ > 1, such that jo = jk and for each i with 0<i<k. 

By definition of >, this means that = Dy, -4- for 0 < i < k. Sub- 
stituting, we have Dj^ = £>jf, + 'Z'i=o^ji,jt+r Since jo = jk, it must be that 
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0 = 'Z'i=o^jiji+r Since we hypothesized that the equations have at least one 
solution, Theorem 3.11 tells us that multiple solutions exist. 

For the “only if’ part, suppose that the equations have multiple solutions. 
By Theorem 3.3, there exists a path, jo,---Jh ^ > 1. such that jo = jk and 
0 = Sf=o ^ 0 < / < ^, by as- 

suming the contrary and deriving a contradiction. Accordingly, after renumber- 
ing the j's if necessary, assume that jk-\ jk- Then Dj^ ^ Since 

the equations require Dj^ > Dj^_^ we must have Dj^ > Dj^_^ + Ayj_, j^. 

If we apply Lemma 1.1 to the path jo,---Jk-\^ we see that Dj^_^ > D^o + 
S?=o^ 7 iJi+r Combining these, we get Dj^ > + Sf=o • This implies 

that 0 > , in contradiction to our earlier statement that this sum was 

exactly 0. We conclude that jo > jo and hence that > is cyclic. 

Having established the claim, we can present the desired algorithm. Given 
a solution, construct the > relation in O(n^) time. Then use a topological sort 
algorithm or a strong components algorithm to determine whether > is cyclic. 
0 



Having determined that the SMO equations might have multiple solutions, we 
are faced with a question of interpretation, namely, which is the “right” solution? 
At least two viewpoints are possible. 

One viewpoint is that the minimum fixed point of the equations is the pre- 
ferred solution because this is what the actual circuit “does” when started from 
an arbitrary stable initial state. When the circuit first begins to be clocked, sig- 
nals leave latches at the opening edge of the latches. As operation continues, 
departure times get pushed out later and later into the transparent interval of the 
latches. Eventually the circuit reaches a steady state and the arrival and depar- 
ture times stay constant. Like Construction 2.1, the actual circuit begins with 
arrival and departure times as small as possible and converges upwards to an 
equilibrium position. Indeed, we can interpret variable Af* in that construction 
as the latest arrival time at latch i of any signal that began at the opening edge 
of some latch and subsequently passed through m or fewer latches during their 
transparent phases. 

Another viewpoint is that it is not permissible to operate a circuit under a 
clock schedule that has multiple solutions. The rationale here is that each so- 
lution represents a possible operating point of the circuit, and all the operating 
points are in equilibrium with each other. The slightest perturbation of a delay 
or of the arrival of a clock signal (possibly caused by a power supply fluctua- 
tion or other environmental effect) can cause the circuit to switch from operating 
point to another. Since we know that some of the solutions violate setup require- 
ments, we conclude that it is too dangerous to operate the circuit this close to the 
“edge.” Said another way, there is no margin for error under such a schedule. 
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Figure 4- A circuit and schedule in which upwards convergence of early arrival times takes 
arbitrarily long. For m < 4/e, = 2+ me, and for m > 4/e, — 6. 



4. Traps for the Unwary 

Even in the absence of multiple solutions, iterative methods for solving the 
a and d equations can take arbitrarily many iterations regardless of the choice 
of starting point and the direction of convergence. In Figure 4 we see a circuit 
and schedule which takes 4/e iterations to reach a solution under upwards con- 
vergence. Note however that the A and D equations for this same circuit and 
schedule have no solution. A close rereading of the proof of Lemma 2.5 reveals 
that the a and d equations are only guaranteed to converge in n iterations when 
the A and D equations are known to be solvable. 

Earlier approaches [1,5] initialize the A and D equations using lower bounds, 
and the a and d equations using upper bounds. The convergence for the A and 
D equations is upwards and for the a and d equations downwards. It is easy to 
show that the latter process must converge downwards because the first iterate 
is componentwise less than or equal to the zeroth iterate and the solutions are 
bounded below. Another approach [6] attempts to first solve the A and D equa- 
tions by upward convergence and then uses this solution to provide initial values 
for the a and d equations which then converge downwards to a solution. Unfor- 
tunately, both these methods can take arbitrarily long as illustrated in Figure 5. 
Moreover, neither process necessarily finds a minimum fixed point for the equa- 
tions, so we might not get the “right” answer vis-a-vis the hold constraints (it is 
easy to see that if any solution has a hold violation, then the minimum solution 
will too). 

When the SMO formulation is used for finding an optimal clock schedule, a 
mathematical optimization program is used to find a clock schedule and a set 
of arrival and departure times, {a,d,A,D). Unfortunately, as we saw in Corol- 
lary 3.4, it is likely that multiple solutions exist for an optimal schedule, and so 
one must carefully interpret whether such a schedule is in fact correct. More- 
over, if the min and max operators in the equations were relaxed to inequalities 
before being passed to the solver as advocated in [7], then the schedule returned 
by the optimizer is correct but the arrival and departure times returned by the 
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Figure 5. A circuit and schedule in which downwards convergence of early arrival times takes 
arbitrarily long. The least fixed point of the latest equations has A 2 = D 2 = 7. Taking this as 
a starting point for iterating the earliest time equations, we have — 1 -niziox m< 1/e, and 
^2 = 6 for all larger m. 



solver might not even satisfy the original SMO equations. It has been suggested 
that iterating the equations using the times from the solver as starting values is 
will yield a solution. This is true, but probably unwise. First, it might take a 
long time to converge if the a and d values decrease. Second, you might con- 
verge to a solution other than the minimum solution. It probably makes more 
sense to simply discard the arrival and departure times found by the solver and 
use the verification algorithm of §2 to calculate a minimum solution directly 
from the optimal schedule. Depending on one’s view on the multiple solution 
phenomenon, one might then test whether the solution satisfies the setup and 
hold constraints, or else test the solution for uniqueness using Theorem 3.13. 

An amusing insight is provided by considering the operation of the algorithm 
of §2 as a simulation of the operation of the circuit during the first n clock 
cycles after power is turned on. As the early arrival and departure times increase 
monotonically to their steady-state values, they might very well violate hold 
requirements at various latches. This can happen even if the equations have a 
unique solution and no hold constraints are violated at that solution. This implies 
that any reset operations intended to initialize the circuit to a consistent state 
should persist for at least as many cycles as it takes the algorithm to converge. 

5. Implementation Experience 

We implemented the algorithm shown in Figure 6. It is easy to show that 
this algorithm gives the same answer as Construction 2.1, converges at least as 
fast, and only uses 0{n) storage. Not shown are a number of features that make 
the algorithm more useful in practice. First, critical short and long paths can be 
recovered after execution by storing with each A,- or a,- the value of j which last 
caused it to be increased. Second, this same backtracking can be used to print 
out a critical long path if the algorithm diverges. Finally, the A,- and a,- should 
be “clipped” to 7t - S,- in order to prevent the propagation of false errors which 
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for each i with 1 < / < n do 
Ai <- Ui i oo; 

for m •<— 1 step 1 until m> n 
for each / with 1 < / < n do 
Di <- ma\{Ai,Bi); 
di •<- max(a,',5,); 
for each i with 1 < / < n do 

Ai ^ max(/l,-,maX;-^,(Dy + Ay,/)); 
a,- •<- max{ai,mmj^i{dj +Xj^i)); 
if no Ai or ai changed during this pass then 
return “algorithm converged”; 
return “algorithm diverged”; 

Figure 6. Pseudo-code for the algorithm as implemented. 



might otherwise occur when an overly long path continues on to other latches. 
Users usually only want to see a diagnostic at the first late latch in each such 
path. Various heuristics may be employed to make the algorithm run faster, for 
example, only evaluating those equations for which an argument has changed 
during the previous iteration. 

The program itself is only 558 lines of C, most of which are concerned with 
reading the circuit description or formatting the output. We ran the program on 
transformed versions of all the ISCAS ‘89 benchmarks as described in [8] using 
clock schedules that had been found to be optimal by other means. The largest 
such circuit had 3272 latches and 67704 edges in the -4 relation. In all cases the 
running time of the algorithm was less than 20 seconds, almost all of which was 
consumed reading the circuit description and building data structures. Moreover, 
only a few iterations were ever necessary for the algorithm to converge, implying 
(for these circuits anyway) that signals do not usually flow continuously through 
very long chains of transparent latches without having to stop and wait. 

6. Summary 

Although the solution to the SMO equations is not necessarily unique, mul- 
tiple solutions can only occur at the optimal clock period. The presence of a 
critical cycle in the circuit is both a necessary and sufficient condition for these 
multiple solutions to exist. We offered some viewpoints on the physical signif- 
icance of non-unique solutions, and pointed out some of the complications that 
they cause in both timing verification and optimization. 

Simple iterative techniques can be used to solve the equations, but one must 
use care in picking the starting point. The wrong starting point can lead to 
arbitrarily long running times and incorrect results, but the correct starting point 
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is guaranteed to converge in n iterations, where n is the number of latches in the 
circuit. The total running time is at most 0{r?) time, or 0{ne) time where e 
now is the number of edges in the circuit graph. Given a solution, we can test in 
0{n^) time (alternatively, 0{e) time) whether it is a unique solution. 

The algorithm is simple to implement, and we ran it on enough test circuits 
to conclude that it takes more time in practice to read the circuit than to run the 
algorithm. 
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Abstract 

Retiming is a technique for optimizing sequential circuits. It repositions the registers in a circuit 
leaving the combinational cells untouched. The objective of retiming is to find a circuit with the 
minimum number of registers for a specified clock period. More than ten years have elapsed since 
Leiserson and Saxe first presented a theoretical formulation to solve this problem for single-clock 
edge-triggered sequential circuits. Their proposed algorithms have polynomial complexity; how- 
ever naive implementations of these algorithms exhibit 0{n^) time complexity and O(n^) space 
complexity when applied to digital circuits with n combinational cells. This renders retiming 
ineffective for circuits with more than 500 combinational cells. This paper addresses the imple- 
mentation issues required to exploit the sparsity of circuit graphs to allow min-period retiming 
and constrained min-area retiming to be applied to circuits with as many as 10,000 combinational 
cells. We believe this is the first paper to address these issues and the first to report retiming 
results for large circuits. 



1. Introduction 

Retiming is a sequential logic optimization technique. Leiserson and Saxe 
provided the first formulation and theoretical solution to this problem in 1983 [4] 
although their later paper [5] has the most complete overview of this work. Re- 
timing uses the flexibility provided by repositioning memory elements to opti- 
mize a circuit for one of several objective functions: 

1 min-period:minimize the clock period of the circuit 

2 min-area:minimize the number of registers in the circuit 

3 constrained min-area:minimize the number of registers in the circuit 
subject to a maximum constraint on the clock period, or indicate failure if 
the target period cannot be achieved. 

As a means of motivating and introducing the concept of retiming, we present a 
simple example in Figure 1. Assume that each gate has the delay shown inside 
it. The solid rectangles represent edge-triggered registers. A single clock is 
used to drive the clock pins of registers. The best clock period for such a circuit 
(neglecting clock skew and set-up time of registers) is given by the maximum 
delay of a path consisting of gates. The clock period for the circuit shown in 
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Figure 1. A simple circuit. 



Figure 1 is 6 units. An equivalent circuit with 3 registers and a clock period of 
4 units can be obtained by repositioning registers as shown in Figure 2. This 
circuit has the minimum number of registers. On the other hand, the minimum 
period achievable by moving registers is 2 units at a cost of 4 registers as shown 
in Figure 3. Thus a simple re-configuration of memory elements yields designs 
with differing area costs (number of registers) and performance (clock period). 
It is this trade-off that we are interested in investigating. 

For digital circuit design, the only interesting objective function is con- 
strained minimum area retiming. However, the minimum period retiming prob- 
lem remains important as a step in solving the constrained min-area problem. 
This is because the min-period problem is computationally less intensive and it 
provides a lower-bound for the best delay achievable by the constrained min- 
area problem. For these reasons, we address both the min-period and con- 
strained min-area optimization problems in this paper. 




Figure 2. Retiming for minimum registers. 



2. Previous Work 

For the case of circuits with edge-triggered memory elements (registers) 
clocked by the same signal, solutions to all three problems are described by 
Leiserson and Saxe [5]. Without taking anything away from their significant 
contribution, we mention simply that the implementation details necessary for 
large sparse circuits were not reported as part of their work. Ishii, Leiserson, 
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Figure 3. Retiming for minimum period. 



and Papaefthymiou [2], extend the concepts to handle level-sensitive memory 
elements. Lockyear and Ebeling [6] present an alternative approach to retiming 
circuits with level-sensitive memory elements. However, both of these papers 
address the theoretical issues involved and not implementation or efficiency de- 
tails. Papaefthymiou and Randall report experimental results for level-sensitive 
memory elements [8], but the largest circuit they handle has 379 gates. Munzner 
and Hemme [7] propose a heuristic algorithm for constrained min-area retiming 
to convert a combinational circuit into a pipeline. Even though retiming can be 
used for the same effect, they justify the use of a heuristic algorithm by stating 
that the retiming algorithms cannot handle circuits with more than 400 gates. 

Although theoretical solutions to edge-triggered flip-flop retiming and related 
problems have been presented in the literature, very few papers have reported 
experimental results using retiming. To the best of our knowledge, experimen- 
tal results for constrained min-area retiming problem have not been reported. 
The only reported results for min-period retiming we have found are for small 
circuits. We believe the primary reason for this is that, although the algorithms 
are polynomial in the circuit size, naive implementations suffer the worst-case 
(0{n^) time and 0{n^) space) for all circuits. 

3. Definitions 

A sequential circuit is an interconnection of logic gates and memory ele- 
ments. A sequential circuit can be represented by a directed graph G{V,E), 
where each vertex v corresponds to a gate v. Each directed edge represents a 
flow of signal from the output of gate u at its source to the input of gate v at its 
sink. Each edge has a weight w(e„v) which indicates the number of registers that 
the signal at the output of gate u must propagate through before it is available at 
the input of gate v. Each vertex v has a constant delay d{v). If there is an edge 
from vertex u to vertex v, u is called a fanin of v and v is called a fanout of u. 
The set of fanouts (fanins) of u is denoted by FO{u) (F/(«)). A special vertex 
called the host vertex is introduced in the graph with edges directed from the 
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host vertex to all vertices that represent primary inputs and edges directed from 
all vertices representing primary outputs to the host vertex. 

A retiming is a labeling of the vertices r:V -^Z where Z is the set of integers. 
The weight of an edge e„v. after retiming is denoted by Wr{euv) and is given by 

Wr{euv) = r{v) + w{euv) - r{u) (1) 

The retiming label r(v), for a vertex v, represents the number of registers moved 
from its output towards its inputs. A path p is defined as a sequence of alter- 
nating vertices and edges, such that each successive vertex is a fanout of the 
previous vertex and the intermediate edge is directed from the former to the 
later. A path can start and end at vertices only. The existence of a path from 
vertex u to vertex v is represented as m v. The weight of a path w{p) 

is the sum of the edge weights for the path. The delay of a path d{p) is the 
sum of the delays of the vertices on the path. A 0-weight path p, is a path with 
w{p) = 0. The clock period c is determined by the following equation: 

c= max {d(p)} (2) 

pKp)=o' 

We briefly summarize the results obtained by Leiserson and Saxe [5]. An 
important concept to the retiming algorithms is the definition of the W matrix 
and the D matrix. They are defined as: 

W{u,v)= min {w(p)}, (3) 

D{u,v)= max {d{p)}- (4) 

and w{p)=w{u,v) 

The matrices are defined for all pairs of vertices (m, v) such that v is reachable 
from M by a sequence of edges and the path does not include the host vertex. 
W(m, v) determines the minimum latency, in clock cycles, for data flowing from 
M to V and D{u, v) gives the maximum delay from m to v for the minimum latency. 

3.1 Minimum period retiming 

The objective is to obtain a circuit with the minimum clock period without 
any consideration to the area penalty due to an increase in the number of regis- 
ters. The retiming constraints for a target clock cycle c translate into two sets of 
inequalities: 

1 Non-negativity of edge- weights after retiming requires Wr{euv) > 0, € 

E, i.e. 

r(v) - r{u) > -w{euv), Ve„v e E. 



(5) 
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2 Correct clocking at a clock period c requires that all paths u-¥ v 
with D{u, v) > c, after retiming have at least one register on it, i.e. 

r(v) — r(i<) > — iy(M,v) + 1, Vm -4 v,D(m,v) > c. (6) 

The sets of constraints from Equations 5 and 6 can be solved using the Bellman- 
Ford relaxation technique developed for the “shortest path on a graph” prob- 
lem [3]. Leiserson and Saxe introduce three algorithms to solve the min-period 
retiming problem; we describe the most efficient one known as the relaxation 
method. Let A(v) denote the largest delay seen along any combinational path 
that terminates at the output of v. 

A(v)=J(v)-|- max {A(m)}. (7) 

u€FI(v) and w(e„v)=0 

It can be shown that the clock period c is given by the expression 

c = max{A(v)}. (8) 

v€V 

The relaxation algorithm has the following 0(|V||£|) subroutine which deter- 
mines if a retiming exists for a given clock period c. We refer the reader to [9] 
for a proof of correctness of this algorithm. 
miii.period relaxation algorithm { 

For each v G V set r(v) = 0 
For |V| - 1 times { 

Compute retimed edge weights (Equation 1) 

Compute A(v), for all v e V (Equation 7) 

For all V e V, such that A(v) > c, do r(v) -t- + 

} 

Compute retimed edge weights (Equation 1) 

If maxvev{A(v)} > c, then no feasible retiming, 
else the current r yields a legal retiming. 

} 



3.2 Constrained Min- Area Retiming 

Under the assumption that all registers have the same area, the min-area re- 
timing problem reduces to seeking a solution with the minimum number of reg- 
isters. Constraints for retiming to be valid are the same as in Equations 5 and 6. 
The formulation for constrained min-area retiming is: 

minSvev(l^f(v)| - |FO(v)|)r(v) 

r(v) - r(«) > -w{euv) Ve„v € E 

r{v) — r{u) > -W (u,v) + I yD{u,v) > c 
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These are linear constraints and the objective function is also a linear function of 
the retiming variables, so linear programming techniques can be used to solve 
this problem. Leiserson et al. [5] indicate that the dual of this problem is an 
instance of minimum cost circulation on a graph for which efficient algorithms 
exist. They also indicate that an initial feasible solution can be obtained directly 
from the problem. 

We will not review the constmction of the minimum cost circulation problem 
(details may be found in [9]), except to note that the retiming graph is augmented 
with dummy vertices, dummy edges, and capacity edges to transform it to the 
graph on which the minimum cost circulation is solved. The edges in this graph 
that originate due to the edge weights are termed “circuit” edges as there is 
one edge for every edge in the original circuit. The edges in this graph that 
come from the clock period constraints are called “period” edges. Note that the 
number of period edges can be very large and destroy the sparsity of the graph; 
we will deal with this issue in the next section. 

4. Implementation issues 

Our goal is to handle circuits with up to 10,000 combinational cells. However, 
we expect circuit graphs to be sparse, Le. |i?| = ^|V^| for small k. For min- 
period retiming, the bottleneck is the requirement that we iterate 0(|V|) times 
to prove that there is no feasible retiming. For constrained min-area retiming, 
the bottleneck is that even when the graphs are sparse, the W and D matrices 
are dense. Further, the number of clock period edges which are implied by the 
retiming equations indicate that the retiming graph augmented with clock period 
edges will be dense. We demonstrate in this section how to solve each of these 
problems. 



4.1 Minimum period retiming 

Let us focus on the relaxation algorithm to determine if c is a feasible clock 
period. It is empirically observed that if c is feasible then the retiming labels 
converge rapidly before completing IV'I — 1 iterations. On the other hand, one 
cannot determine that a clock period c is infeasible until all |V| — 1 iterations 
have been exhausted and the retiming labels have failed to converge. Thus any 
hope of speeding up min-period retiming must focus on detecting if a clock 
period is infeasible before completing the requisite |V| - 1 iterations if possible. 
This is the principal motivation for this section. 

The Bellman-Ford algorithm solves the following problem. Given a directed 
graph G{V,E) with arbitrary edge-weights f : E R (R is the set of reals), and 
vertices with an initial distance marked on them, find the shortest distance (mea- 
sured by sum of edge weights) to every vertex that respects the initial distance 
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For each 



Lit atgitt tttitii 

V 6 F, r(v) = I 



marking. The algorithm can be described as follows (where r(v) now denotes 
the distance marking to a vertex v): 

Bellman-Ford algorithm { 

known original distance, 

+°° otherwise. 

Loop |V| — 1 times { 

For each edge { 

r(v) = min„e/r/(v)(r(v),r(M) + /(e„v)) 

} 

} 

For each edge { 

If r(v) > r{u) + f{euv) then FAIL (negative cycle) 

} 

} 



The graph is permitted to have negative edge weights and hence can have 
negative cycles. The presence of a negative cycle makes the shortest distance to 
any vertex on that cycle undefined. In the presence of a negative cycle, Bellman- 
Ford must report failure to converge. 

We can abort the iteration at any point if we discover a negative cycle in the 
graph. Let us call such a negative cycle a certificate of infeasibility. We present 
a technique inspired by Szymanski who used a similar approach to compute 
lower bounds for clock periods [10]. The predecessor heuristic maintains a pre- 
decessor vertex pointer denoted by predO with each vertex. The Bellman-Ford 
algorithm starts with all predO pointers set to be empty. Every time vertex v 
has its distance decreased, the fanin node that caused the change is stored as 
pred(v); i.e., in the Bellman-Ford algorithm, if during the relaxation of edge 
Cuv, we discover that r(v) > r{u) -H f{euv), then we set pred(v) = «. Thus at 
every instant of the iteration, we have a sub-graph of the original graph main- 
tained by the predecessor edges with [Fj edges and [Vj vertices. 

Each vertex v has a predecessor graph associated with it, defined by repeated 
traversing of the predO pointers, starting at v and ending when either the pre- 
decessor is empty or a cycle is found. At every iteration the predecessor graphs 
of all vertices are examined to see if a negative cycle exists. If v is marked, 
its predecessor graph has been examined during an earlier traversal. A cycle is 
detected by checking if a vertex has already been visited during the walk started 
from the current v. The traversal is stopped whenever a cycle is detected, a 
marked vertex is visited, or the end of the predecessor chain has been reached. 
The complexity of traversal algorithm is 0(|F|) and is dominated by the 0{\E\) 
relaxation of the edges. The traversal mechanism is outlined below: 
cycle detection using predecessor heuristic { 

For each v e F, mark(v) = 0 
For each v € F, cycle(v) = 0 
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cycle_count = 0 
For each v G F { 
if (!mark(v)) { 

cycle_count++ 

M = V 

while(M != NIL) { 

if (mark(M) && cycle(«) == cycle_count) declare 
cycle exists and exit 

if (mark(M)) no cycle in this sub-tree, break out of inner loop 
mark(M) = 1 
cycle(«) = cycle_count 
u = pred(«) 

} 

} 

} 

Let us now extend this analogy to the min-period retiming problem. When 
A(v) is computed for each vertex v, we store with v, a vertex u, such that there 
exists a 0- weight path u v and A(v) = d{p). If A(v) > c, then we 

setpred(v) = u. Consider a cycle in the predecessor sub-graph that includes 
vertices u\,...,Uk = mo, i.e. predCu;) = m,_i, i = Let pt-i^i, denote 

the path m,_i m,- which is used in the computation of A(m,). During the 

iterations, retiming labels increase only by 1. Recall that before the labels are 
updated, = 0 . After update. 



Wr{pi-\,i) = r{ui) - + w{pi-ii) 

= r(M,)-r(M,_i) 

< 1 - 

In addition d{pi-ij) > c. Thus for the cycle 
S d{pi-i,i) > kc 

i=\y..k 

i=\y..k i=ly,.,k 

The term on the left hand side represents the total delay encountered as a 
signal traverses the cycle. The term on the right hand side is the total time 
available for the signal to propagate under the current clock period. Retiming 
cannot change the number of registers in a cycle of a circuit (see Lemma 1 
in [5]). This implies that for a clock period c, this cycle will prevent any feasible 
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retiming. Retiming can only exist for a clock period given by the inequality, 

c > (9) 

The implementation of this technique results in dramatic speed-up in execution 
time. Not only can infeasibility be detected early, but the cycle that caused 
infeasibility provides a new lower bound for the clock period. This can be used 
to bias the binary search effectively. The memory overhead consists of a pointer 
per vertex and an extra field used for detecting the cycle. 

4.1.1 Experimental results. All experiments are run on a Sol- 
bourne Series 5e. We select some circuits from the ISCAS89 suite chosen so 
that they reflect the variation in size of this suite. 

We compare an implementation of the retiming algorithm without predeces- 
sor pointers to an implementation that uses it in Table 1. The two implementa- 
tions are identical except for the part that traverses the predecessor pointers. We 
see a substantial reduction in execution time for the predecessor heuristic. Us- 
ing the cycle obtained as a certificate of infeasibility to update the lower bound 
on the clock period is useful, as the bias eliminates feasibility checks at clock 
periods less than the bound and are guaranteed to be infeasible. 



name 


# gates 


CPU (in sec.) 


clock period 


original 


new 


before 


after 


sl494 


386 


96.2 


2.1 


17.4 


17.4 


sl423 


384 


101.0 


3.9 


36.6 


31.4 


s5378 


887 


527.6 


13.7 


11.9 


10.4 


S9234 


1107 


159.8 


13.8 


19.9 


12.9 


S13207 


1854 


1973.2 


28.8 


23.2 


21.4 


S15850 


2240 


2856.1 


37.4 


40.1 


22.9 


S38584 


7882 


39025.9 


306.2 


35.5 


34.1 



Table 1. Results for minimum period retiming. 



4.2 Minimum area retiming 

Minimum area retiming poses 2 major hurdles; 

1 computing the W and D matrices, and 

2 implementing minimum cost circulation. 

4.2.1 Computing W and D matrices. We shall not describe the 
method proposed by Leiserson et al. [5] to compute the W and D matrices, 
except to say that an algorithm with 0(|Up) time and 0(|Up) memory is pro- 
posed. The W and D matrices are required to add the clock period edges which 
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in turn are required to solve the minimum cost circulation problem using cost 
scaling techniques. Further, all of the clock period edges must be added prior to 
solving the circulation problem. 

The number of clock period edges greatly increases the density of the original 
graph. However, only a small subset of the period edges implied by Equation 6 
are required for the computation; the rest form redundant constraints. The origi- 
nal algorithm has two drawbacks: the 0(| Vp) memory and the inability to prune 
the clock period edges. We propose an algorithm which has a worse complex- 
ity than the original formulation, but whose memory is 0(|V|) and is able to 
generate a smaller set of constraints for sparse circuit graphs. 

Equation 6 describes the conditions under which clock period edges need to 
be added to the retiming graph. In the original formulation of constrained min- 
area retiming, clock period edges are required between all vertices u and v such 
that 

... -> vandD(M,v) > c. (10) 

To see why a smaller set of clock period edges is sufficient for constrained min- 
area retiming, note that if 



r(v)-r(«) > -W(«,v)-f-l (11) 

is true for a sub-path of a path, then it is also true for the entire path. Hence a 
period edge need only be added to vertex v, reachable from w, such that; 

D{w, v) > c and D{w, u) <c\/u lying on w FI{v) (12) 

Consider a vertex w. We are interested in the period edges that have their 
source as w. To do this, it is necessary to examine only a single row of the W 
and D matrices the row W(w, .) and the row D{w, .)). The set of vertices 
can be partitioned into disjoint sets depending on the value of W (w, .) as shown 
in Figure 4. The directed edges in Figure 4 represent paths to other vertices 
from w. The dashed curve represents the set of vertices v that meet the condition 
of Equation 12. We can ignore any period edges between w and the fanouts of 
such vertices. Thus only some of the entries of W(w, .) and D{w, .) need to be 
computed. The elements of W(w, .) can be computed using Dijkstra’s algorithm 
since the edge weights w{euv) > 0, € E. The computation of D{w,.) is 

complicated due to 2 facts; the dependence on W(w, .) and the non-monotonicity 
of the gate delays (akin to finding the longest path in a graph with positive edge 
weights). 

We now describe how to generate a single row of the W and D matrices for 
a single vertex w, and how to find only the set of vertices which satisfy Equa- 
tion 12. An ordered pair {w{euv),—d{v}) is associated with each edge Cuv and 
is used to compute the shortest distance from w. Comparisons are done in lex- 
icographical order. Thus for a path w . . . — v, we obtain the distance as an 
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se 
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Figure 4- Disjoint partitions of the vertices. 



ordered pair denoted by {a^jK). Upon termination of the algorithm this pair is 
the same as {W{w,v),D{w,v)). The algorithm to compute period edges consists 
of applying the following mix of Dijkstra’s algorithm and Bellman-Ford algo- 
rithm. The algorithm maintains a heap for each distinct value of Oy (the heap is 
indexed by this value since there could be several such heaps). Since w(e„v) > 0. 
we are guaranteed that the first component of the distance measure cannot de- 
crease for all vertices in the heap with the lowest index. To compute by for all v 
in the smallest indexed heap, the Bellman-Ford algorithm is used, 
adding period constraint edges { 
c = target clock period 
Sk = the heap 
For each w € V { 
w = current vertex 

For each v € V, Oy = 0 and by = -l-<» 

So = {w}, Oyy = d(w) and bw = 0 
k = current register weight 
do { 

k - minlpl^p ^ 0} 
u = pop_min {Sk) 
if(ft«>c){ 

add period edge from wtou with weight - 1 
} else { 

For each v € FO{u) { 

if {{ay, by) > {au + w{euy),bu + {-d{v)))) { 
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bv = bu + {-d{v)) 

av = a„ + w(e„v) 
heap.insert(5a„+^(^„^), v) 

} 

} 

} 

} while (3p\Sp ^ 0) 

} 

} 

Dijkstra’s algorithm for shortest paths on a graph requires that the distance 
measure be monotonic. The distance measure (as defined by the lexicographical 
order) of a vertex v is a function of the edge-weights and the vertex delays. Edge 
weights are monotonic, since w{e) >0,We€E and we are interested in distances 
with minimum number of edge-weights. However, the vertex delays are non- 
monotonic; i.e. after popping u using popjninO, we cannot conclude that 
has attained the value of D{w,u). To handle this, a Bellman-Ford relaxation is 
carried out for each value of k (the minimum index amongst the set of heaps). 

Consider the analysis for a given w. Let the index k increase from 0 to K. Let 
Vk be the set of vertices for which Vr(iv,v) = k. Let Ek be the set of edges with 
0 edge weights with either sources or sinks in 14. Due to the non-monotonicity 
of vertex delays, we are forced to execute a Bellman-Ford set of relaxations 
for each k. Hence for a given k, there will be at most |14||£*| heap queries, 
each of which requires log| 14 | time; yielding a complexity of IV/tUFtlloglV/tl. 
Since we are guaranteed that every cycle has at least one register, distances 
cannot be increased arbitrarily by traversing cycles. Thus the algorithm requires 
1^*1 1^*1 log l^iD- Note that the vertex sets 14 (and edge sets Ek) are 
disjoint and hence we can bound this by D(|V||F|log|K|), where V and E are 
the maximum sized sets amongst I 4 and Ek (among possible values of k). Since 
this is repeated for each vertex the worst case bound is 0(|Vp|F|log|V|) — 
considerably worse than a Floyd-Warshall (0(|Vp)). There are two benefits to 
using this algorithm. First the memory overhead is a set of heaps whose total 
size can be kept to D(|F|) with some book-keeping. Secondly, the execution 
time rarely displays the behavior predicted by worst case analysis. On sparse 
graphs, the term rarely reaches its upper bound of |V| because, not 

all vertices are reachable from w, and the pruning of distance propagation once 
the delay of a path becomes larger than c effectively restricts the set of vertices 
examined. This pruning strategy speeds up the convergence of the algorithm. 

In practice, the term |F|log|F| is much smaller than |F|log|F|. The final 
proof of this algorithm is in the implementation. For some of the large circuits 
used in the experiments, it is almost impossible to finish computing the W and D 
matrices in a reasonable amount of time using a Floyd-Warshall implementation. 
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With the above algorithm, the execution time is comparable to minimum cost 
circulation computation. 

4.2.2 Minimum cost circulation using cost-scaling. A dis- 
advantage of cost-scaling techniques is that the graph cannot change during 
the computations; consequently all the period edges must be determined before 
starting the cost-scaling algorithm. Our implementation is based on the gener- 
alized cost-scaling framework of Goldberg and Taijan [1] and has a complexity 
of OdV'll^l log|V|log(|V^|C)) where C is the largest cost in the graph i.e. one 
more than the number of registers in the circuit. If | V| is 10,000, and we restrict 
ourselves to 32-bit integer arithmetic, C must be less than 200,000; since C is 
the number of registers in the circuit, this is a reasonable assumption. The algo- 
rithm operates by maintaining an error from optimality and successively halving 
it. At the start this error is \V\C. However if an initial flow is known, the value 
of C can be reduced to be the minimum value that has to be added to the cost 
of each edge so that the graph does not have a negative cost cycle. This fact re- 
duces the .value of C by an order of magnitude. An initial flow for the circulation 
graph can be constructed easily. For most circuits C turns out to be less than 10, 
effectively making the algorithm independent of C. 

As an aside, one has to be careful with the edge capacities. In general these 
are real numbers and can cause convergence problems. For sake of stability, the 
edge capacities are scaled to integers using a factor which is a function of the 
least common multiple of the number of fanouts of vertices in the graph. 

4.2.3 Experimental results. The experimental setting for min-area 
retiming consists of the following steps. A min-period retiming is carried out to 
determine a bound on the best achievable clock period. The number of registers 
in the min-period solution is examined. A min-area retiming is performed on 
this circuit with the clock period set to the best achievable clock period. This 
yields an area optimal solution without any sacrifice in performance. 

The results for constrained min-area retiming are compared to the results for 
min-period retiming in Table 2. The first set of circuits consists of the ISCAS 
circuits used for min-period retiming. The last 3 circuits (the second set) are 
multipliers pipelined by placing three registers in series at the outputs in each 
case (see Section 5 for details). As can be seen, the min-period solution can be 
very far away from the area optimal solution at the same clock period. 

5. Tracing area-delay curves 

One motivation for constrained min-area retiming algorithms is to examine 
the area-delay trade-off. This section presents a set of experiments on two multi- 
pliers: an 8x8 multiplier with 16 outputs and a 16x16 multiplier with 32 outputs. 
We choose multipliers because they are common circuits found in many data- 
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name 


# gates 


# registers 


CPU 

for 

min-area 


min 

period 


min-area at 
min-period 


sl494 


386 


1 


1 


58 


sl423 


384 


81 


78 


84 


s5378 


887 


296 


169 


300 


s9234 


1107 


328 


241 


641 


S13207 


1854 


548 


481 


2467 


S15850 


2240 


655 


529 


4668 


S38584 


7882 


1433 


1429 


139328 


8x8 


542 


179 


102 


38 


16x16 


3030 


710 


183 


2459 


32x32 


«13k 


2763 


403 


67155 



Table 2. Results for minimum area retiming. 



paths and signal processing designs. Each multiplier has 3 registers in series at 
each output. The area and delay characteristics for the initial circuits are sum- 
marized in Table 3 (columns 2-4). The clock period is the maximum delay from 
an input to an output, since the registers are all placed at the outputs. As the 
latency of each output is 3, we expect that min-period retiming will partition 
the circuit into 4 regions that have almost the same delay. Thus the value of the 
clock period at a min-period solution is expected to be roughly a fourth of the 
original clock period (column 5). 



name 


# gates 


regi- 


clock period 






sters 


original 


min-period 


8x8 


542 


48 


27.85 


7.53 


16x16 


3030 


96 


58.82 


15.92 



Table 3. Properties of pipelined multipliers. 



The next experiment concerns the area delay trade-off that is possible by 
changing the target clock period. The lower bound on the clock period is com- 
puted using the min-period retiming algorithm. The original clock period is 
an upper bound. 4 equi-distant points are picked between the upper and lower 
bound, as a target clock period for each circuit. The minimum number of regis- 
ters required for each clock period is computed (in Table 4). An effective area- 
delay trade-off is seen. The CPU variation with the clock period is also shown in 
Table 4. This variation can be ascribed to the changing number of edges in the 
graph, which varies significantly as the clock period is varied. This variation is 
a response to the different target clock periods. For a given circuit, as the target 
clock is varied, the number of period edges varies considerably (see Table 5). 
At clock periods close to the lower bound, the period edges dominate all the 
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other edges in E. The number of period edges steadily increases as the target 
clock period is lowered and then decreases. This can be explained as follows. 
As the clock period decreases, the number of paths that need to be constrained 
increases. This is the general trend. Recall the pruning strategy used in the com- 
putation of period edges that ignores any path that extends from a sub-path with 
delay greater than the clock period. This has little effect at large clock periods, 
since most reachable vertices have delays less than the clock period. But when 
the clock period decreases, this strategy results in a decrease of period edges. 
Note that in all the cases, the graph remains sparse even after the period edges 
are added. 



name 


# registers at clock = T^in 


“h Instep 






k=0 


k^l 


k=2 


k=3 


k=4 


k=5 


8x8 


102 


69 


63 


54 


52 


48 


16x16 


183 


139 


126 


117 


104 


96 




CPU time at clock = Tmin 


"h ktstep 




8x8 


33.8 


31.0 


31.4 


37.9 


36 


36.1 


16x16 


2459 


4264 


2325 


2860 


3397 


2840 



Table 4- Variation of register count with clock period. 



name 


|V| 


E\!\V\ at clock : 


“ Tfnin “h ktstep 








k=^0 


k^l 


k=2 


k=3 


k=4 


k=5 


8x8 


799 


8.6 


9.7 


6.7 


4.1 


3.3 


3.1 


16x16 


3944 


25.5 


38.5 


26.5 


7.3 


3.7 


3.2 



Table 5. Sparsity of retiming graphs of pipelined multipliers. 



6. Conclusions 

The practicality of constrained min-area retiming (min-area retiming subject 
to a target clock period) for large circuits has been demonstrated. The issues that 
need to be addressed for a successful implementation require careful analysis 
and a good understanding of software techniques and computer algorithms. We 
hope to have high-lighted some of these issues in this technical discussion. 

The predecessor technique to detect infeasibility of a clock period has been 
demonstrated to be very effective for min-period retiming. 

Two key contributions are made for the constrained min-area retiming prob- 
lem. First, we have shown that hnding essential period edges can be done ef- 
ficiently. Secondly, using the information provided by an initial flow is critical 
for convergence of the cost-scaling algorithm. 
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The area-delay trade-off demonstrates a big win for the constrained min-area 
algorithm over the min-period algorithm, although this is hardly surprising. Al- 
though both algorithms have been known for several years, the experimental 
evidence has been lacking, especially for large circuits. The min-area algorithm 
enables a designer to control the area-delay trade-off of retiming in a precise 
manner. 

Finally, we would like to dispel the notion that the existence of a polynomial- 
time algorithm implies that the techniques can be applied to large circuits; we 
must have near-linear complexity to handle entire circuits in one piece. 
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A CADENCE PERSPECTIVE ON ICCAD 



Louis K. Scheffer 
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Abstract 

ICCAD is the premier technical conference for CAD tools for ICs, and Cadence is the largest 
commercial vendor of such tools. Therefore it is not surprising that the two have many connec- 
tions. Research from Cadence has been presented almost every year at ICCAD, and Cadence 
contributes heavily to other facets of the conference such as tutorials and panels. There is by 
no means a uni-directional flow of information, though - considerable research first reported at 
ICCAD has made its way into commercial products sold by Cadence. This paper covers some of 
the contributions to and from Cadence in the areas of timing, circuit simulation, layout analysis, 
physical design, logic verification, synthesis, and place and route. 



1. Introduction 

In a fast moving field such as CAD for IC design, conferences play a cru- 
cial role in the dissemination of results. They are available to a much wider 
audience than internal technical reports, and appear much sooner than the ref- 
ereed journals. Conferences inspire follow-on research through reports of work 
in progress and opportunities for informal discussion. ICCAD, as the premier 
technical conference in the field, has seen the introduction of a number of great 
ideas. This volume is a tribute to these ideas. 

As the largest vendor of CAD tools. Cadence has long had strong connections 
to ICCAD. This includes presenting internal research, helping with the confer- 
ence itself, and using ideas brought forth by others. This paper describes both 
how Cadence has contributed to ICCAD and how ideas from the conference 
then translated into commercial practice at Cadence. Since the contributions to 
ICCAD are largely public (and voluminous), they are only summarized. The in- 
fluence of ICCAD on Cadence products is less readily available, and constitutes 
the bulk of this paper. 

2. Cadence Contributions to ICCAD 

Almost every year, papers from Cadence appear at ICCAD, and Cadence 
employees serve on the technical and organizing committees, appear on panels, 
serve as session chairs and otherwise support the conference. For example, at 
the 2001 ICCAD alone. Cadence employees 

O Authored or co-authored eight papers 

O Chaired two conference committees 
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O Served on the technical program committee (3 people) 

O Presented or co-presented two tutorials 
O Organized a panel 
O Moderated six sessions 

There were also many other papers by professors that work closely with Ca- 
dence. 

Clearly space does not allow listing all the Cadence research that has been 
reported at ICCAD, but three best papers selected for this book deserve spe- 
cial mention. The paper “Modeling the Driving-Point Characteristic of Resis- 
tive Interconnect for Accurate Delay Estimation” by O’Brien and Savarino, (see 
page 393), “A Method for Correct by Construction Latency Insensitive Design” 
by Carloni, McMillan, Saldanha, and Sangiovanni-Vincentelli (see page 143) 
and “Grasp - A New Search Algorithm for Satisfiability” by Silva and Sakallah, 
(see page 73) all were ground breaking research done at Cadence. In addition, 
many authors of academic ICCAD papers have continued their work at Cadence, 
as demonstrated by “Nonlinear Circuit Simulation in the Frequency Domain” by 
Kundert (see page 383). Finally, many of the academic authors of ICCAD pa- 
pers have strong ties to Cadence. Alberto Sangiovanni-Vincentelli, for example, 
is an author of several papers in this volume and also serves as Chief Technical 
Advisor to Cadence. 

3. ICCAD Influence on Cadence 

Many techniques now embedded in commercial products were introduced, or 
refined, in ICCAD papers. Again, a fully detailed accounting would be quite 
long, so this section concentrates on the ICCAD papers selected for this book, 
and examines their influence on Cadence products. 

3.1 Timing 

In the 1980s, wire delays were short compared to cell delays, and the entire 
capacitance of the net could be used as the cell load. As processes shrank, this 
was no longer true, making delay calculations inaccurate. The paper “Modeling 
the Driving-Point Characteristic of Resistive Interconnect for Accurate Delay 
Estimation” by O’Brien and Savarino (see page 393) represented each net as a 
pi network loading the driver and separate delays to each input. This was easy 
to implement and gave much more accurate results than lumped models, and so 
became the standard for representing interconnect delays for many years. The 
results are still visible as the circuit model in RSPF (Reduced Standard Parasitic 
Format). However, as processes shrank further, this model too became inac- 
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curate, and was in turn replaced by more sophisticated moment based methods 
such as Prima, also introduced at ICCAD. 

The paper “Prima: Passive Reduced-Order Interconnect Macromodeling Al- 
gorithm” by Odabasioglu, Celik, and Pileggi (see page 433) explained how to 
generate reduced models that are guaranteed to be passive as well as stable. This 
is a huge benefit since it guarantees stability when models are composed or com- 
bined with external sources. All the modem delay calculators in Cadence use 
algorithms similar to, derived from, or inspired by this work. We also use sim- 
ilar methods on multi-port extracted networks for the analysis of coupled noise 
(see Shepard, et al. [1]) in our noise analysis products Pacific™ and Celtic™. 

3.2 Circuit Simulation, Analysis, and Synthesis 

Spice level analysis of RF circuits has always presented special problems. 
The paper “Nonlinear Circuit Simulation in the Frequency Domain” by Kundert 
(see page 383) extended ‘Harmonic Balance’ to large circuits and embedded it 
in Spice, enabling RF analysis that were previously impractical. While this par- 
ticular technique is not used in Cadence today, the idea of algorithms specialized 
for the task of RF design of ICs began with this paper. This led to the current 
product Spectre-RF™. The history of this technique (and FastHenry, below) is 
covered in much more detail in the review by Kundert and White. 

As frequencies rise, inductance becomes more important in both analog 
and digital design. The paper “Efficient Techniques for Inductance Extrac- 
tion of Complex 3-D Geometries” by Kamon, Tsuk, Smithhisler and White (see 
page 403) introduced FastHenry, a tme 3-D inductance solver. Within Cadence, 
this approach is used directly to model RF packages, and indirectly to calibrate 
less accurate but faster approximations used in the analysis of digital circuits[2]. 

Analog synthesis has always lagged far behind digital synthesis. An early 
attempt was “Automatic Synthesis of OPAMPS on Analytic Circuit Models” by 
Koh, Sequin, and Gray (see page 313). This work would find solutions to user- 
defined analytic approximations. This was rapidly followed by “Analog Circuit 
Synthesis for Performance in OASYS” by Haijani, Rutenbar, and Carley (see 
page 325), an approach that could handle non-linear designs as well. These 
techniques are reflected in the few analog synthesis tools available today, such 
as NeoLinear. 

3.3 Physical Design 

The paper “Tilos: A Posynomial Programming Approach to Transistor Siz- 
ing” by Fishbum and Dunlop (see page 295) introduced a practical algorithm 
for transistor sizing on large circuits. They showed that the problem was convex 
and hence a local optimum was a global optimum, and gave an algorithm that 
quickly converged to the local optimum. The algorithm is still in use since it is 
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Pareto-optimal - all faster algorithms give worse results, and only slightly better 
results come from much slower approaches. 

In the 1980s, most of the commercial placement engines relied on simulated 
annealing, such as described by Swartz and Sechen[3]. However, these meth- 
ods were too slow for increasingly large netlists. The paper “GORDIAN: A 
New Global Optimization/Rectangle Dissection Method for Cell Placement” by 
Kleinhans, Sigl, and Johannes (see page 499) was the first of the quadratic based 
approaches that produced quality results comparable to TimberWolf with better 
run times. It was capable of handling very large designs with reasonable perfor- 
mance, and was in turn the foundation of many further improvements. These in- 
clude linear wire length optimization[4], detailed placement by network flow al- 
gorithms[5], and inclusion of additional objectives[6]. Approaches derived from 
this work are included in many Cadence products, in particular QPlace™, the 
block and standard cell placer, and PKS™, the combined synthesis/placement 
product. 

The paper “Exact Zero Skew” by Tsay (see page 509) introduced the idea that 
zero skew could be obtained without an exactly balanced tree, by connecting two 
leaves, picking the zero skew tap point, and then performing this recursively up 
a hierarchy. This was much more practical than previous solutions such as H 
trees since it could handle arbitrary distributions of non-uniform flip-flops and 
still give zero skew. Furthermore it lead directly to bounded and/or associative 
skew approaches that were more useful yet[7]. These ideas have made their 
way into all the Cadence products that generate clock trees, including CTGEN, 
CTPKS, and Silicon Ensemble™. 

Layout extraction gives device level netlists, but many operations, such as 
noise analysis and cell characterization, wish to understand the gate level be- 
havior of a circuit. The conversion from transistor level netlists to gate netlists 
was initially performed by pattern matching, but this always suffered from in- 
complete pattern libraries. A better but more challenging approach is to derive 
the gate level behavior from the circuit. This was pioneered by Bryant in “Ex- 
traction of Gate Level Models from Transistor Circuits by Four- Valued Sym- 
bolic Analysis” (see page 337) and his earlier work on COSMOS. These ideas 
are used in Cadence noise analysis products such as Pacific and Celtic, and the 
cell characterization included in the delay analysis tool SignalStorm™and the 
IR drop analysis tool VoltageStorm™. 

3.4 Logic Verification 

The paper “Automating the Diagnosis and the Rectification of Design Errors 
with PRIAM” by Madre, Coudert, and Billon (see page 17) was a precursor of 
the “Modified Framework...” paper below. Although the techniques of this paper 
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are not used directly today, in retrospect it paved the way to more advanced and 
practical applications. 

The paper “A Modified Framework for the Formal Verification of Sequential 
Circuits” by Coudert and Madre (see page 39 ) is the most important ICCAD pa- 
per in the formal verification space. It is the first published work, following its 
presentation at CAV’89[8], of a model checking method for state machines that 
is based on binary decision diagrams (BDDs). Using BDDs for model checking 
is largely held responsible for moving model checking from the academic do- 
main to the commercial domain. Today, it shows up in many products such as 
equivalence checkers and formal tools. By using BDDs for model checking, the 
size of problem that could be handled automatically was dramatically increased, 
to a size that made the practice commercially viable. In an historical sense, the 
descendants of this work include Cadence products such as the formal verifier 
FormalCheck™and other products currently in development. 

The paper “Dynamic Variable Ordering for Ordered Binary Decision Dia- 
grams” by Rudell (see page 51) was the next big contribution. Once BDDs had 
shown their value, they also showed an annoying sensitivity to variable order. 
Rudell showed that the order could be optimized dynamically. This work laid 
the base for all commercial model checkers, as without dynamic reordering, the 
tools would not be practical 

The paper “Grasp - A New Search Algorithm for Satisfiability” by Silva and 
Sakallah (see page 73) was one of the early big improvements in SAT solvers, 
opening the way to many other heuristic improvements - it was one of the early 
demonstrations of the value of heuristics; today SAT solvers are ubiquitous in 
industry. In verification, today they offer an alternative to BDDs for certain 
types of model checking. 

3.5 Synthesis 

The paper “Multiple-Level Logic Optimization System” by Brayton, Det- 
jens, Krishna, Ma, McGeer, Pei, Phillips, Rudell, Segal, Wang, Yung, and 
Sangiovanni-Vincentelli (see page 191) is the basis for SIS which is the basis 
for logic optimization in the industry. 

The paper “A Method for Concurrent Decomposition and Factorization of 
Boolean Expressions” by Vasudevamurthy and Rajski (see page 227) is the 2- 
cube kernel extraction. It made the kernel extraction from (above) feasible and 
far more practical for complex designs. It became part of standard SIS, and we 
also use it in our synthesis tools BuildGates™and PKS. Also, the low-power 
logic optimization in these products uses this 2-cube kernel extraction. 

Also concerning power reduction, the paper “Hyper-LP: A Design System for 
Power Minimization using Architectural Transformations” by Chandrakasan, 
Potkonjak, Rabaey, and Brodersen (see page 1 17), was one of the first papers 
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to address power minimization, a crucial field today. Almost all power sensitive 
synthesis, including the current Cadence offerings, follows from this work in 
spirit if not in detail. 

4. Summary 

ICCAD is much more than a purely academic conference. As shown in this 
paper, and similar papers from other companies, industry has contributed a great 
deal to ICCAD and received a wealth of good ideas in return. Almost all com- 
mercial CAD products, from Cadence and others, incorporate ideas that were 
were first reported at ICCAD. Since almost all chips are designed using one or 
more CAD tools, practically every chip built today incorporates some ideas from 
the conference. Since these chips are used in almost all modem electronics, and 
modem electronics are everywhere, this means that every day a good fraction of 
the world’s population uses the practical fruit of ICCAD. Not many conferences 
can make that claim! 
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Abstract 

This article describes the impact of ICCAD on Fujitsu, along with Fujitsu’s contribution to 
ICCAD. 

1. Introduction 

Fujitsu Ltd., Fujitsu Laboratories Ltd. (FLL), Fujitsu Laboratories of Amer- 
ica Inc. (FLA), and other subsidiary companies related with VLSI design and 
CAD development congratulate ICCAD for 20 years of innovation in design 
automation. 

CAD is the key technology for VLSI design and therefore has been one of 
Fujitsu’s focused research areas. Started in the 1970s, Fujitsu CAD research 
now covers a wide spectrum of activities including high level design, verifi- 
cation, simulation, logic synthesis, floorplanning, placement, routing, and test. 
Fujitsu Laboratories of America, Inc. (FLA), headquartered in the heart of Sil- 
icon Valley, was established in 1993. It triggered close collaboration between 
CAD researchers in US and Japan. The CAD research group is one of the most 
important departments in FLA. The research essential for system-on-a-chip in 
deep sub-micron (DSM) era is conducted and practical tools based on these re- 
search results are developed, both in US and Japan. 

ICCAD is the best technical conference in design automation. Fujitsu has 
been contributing to ICCAD since 1987, when our first ICCAD paper appeared. 
Our researchers have served as speakers, session chairs, moderators, organiz- 
ers, and members of technical program committees on several occasions. In 
addition to providing them with opportunities to present new research on basic 
VLSI CAD technologies, ICCAD has also been an invaluable source of ideas, 
technology, inspiration, and feedback to our research and development teams. 
This article presents the impact of ICCAD on Fujitsu CAD and also Fujitsu’s 
contribution to ICCAD. 

2. Verification 

Twenty years ago, the word “verification technology’’ meant simulation tech- 
nology. Some acceleration techniques such as dedicated hardware for simula- 
tion [7] were proposed. Although not widely accepted from an academic view- 
point, these have been providing effective verification solutions as emulators for 
a long time. 
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On the contrary, formal verification techniques were only in the realm of pure 
academic research. A big turning point came in the mid 1980s with the intro- 
duction of the Binary Decision Diagrams (BDDs). The potential of BDDs was 
first presented at ICCAD in 1988 [19]. The approaches on page 29 and page 65 
were consolidated into equivalence checking of two large combinational cir- 
cuits. These two seminal pieces of work form the basis of almost all existing 
equivalence checking products, including Fujitsu’s. Current commercial equiv- 
alence checking tools adopt a multi-engine approach, which uses ATPG, linear 
programming, SAT(see page 73) as well as BDDs. FLA originally developed 
this multi-engine approach [25]. 

Model checking also benefited from BDDs [21]. The paper on page 39 is 
an important contribution in the area of implicit state traversal using BDDs and 
forms the basis of symbolic model checking. It was followed by a series of more 
practical applications [10, 9]. 

BDDs brought with them several interesting research problems, the most im- 
portant being that of variable (re)ordering. The BDD size depends strongly on 
the variable order. Techniques in the paper on page 51 and [16, 8] were pro- 
posed to address this problem. To handle the size explosion, we also proposed 
a new representation: Partitioned Ordered BDD or POBDD [2], along with a 
POBDD-based reachability technique for state machine traversal [1]. 

Thus the pioneering accomplishments on BDDs and model checking tech- 
nologies presented at ICCAD brought out several successful research results at 
Fujitsu. These also led to Fujitsu’s in-house equivalence checkers and model 
checkers, which are widely used today by designers in Fujitsu. 

3. Logic Synthesis 

In mid 1980s logic synthesis focused on multi-level logic circuits. This was 
made possible largely because of the powerful logic synthesis algorithms and 
framework provided by “MIS” on page 191. A researcher could quickly im- 
plement and evaluate his/her ideas in this framework. We made our in-house 
logic synthesis tool based on MIS. ICCAD provided several key logic synthesis 
technologies, which were incorporated in our synthesis tool. MIS showed that 
restructuring techniques are effective for logic minimization. The kernel extrac- 
tion technique on page 227 enabled fast and powerful logic decomposition. 

Using our logic synthesis tool, we have shown that permissible functions can 
be represented compactly with BDDs and used elfectively for multi-level logic 
optimization [31]. This work led to remarkable progress in the research on don’t 
cares and BDDs [32, 22]. We have also studied technology mapping for FP- 
GAs [20] and for ASICs [30]. 

In the age of deep sub-micron technology, timing issues have acquired great 
importance. [17] provided a useful framework for performance optimization. 



Industry Viewpoints 



641 



We improved upon this framework so that it could handle large industrial cir- 
cuits [23, 33]. We also made significant contribution to the complexity theory 
of several delay optimization problems [26, 27]. We showed that certain forms 
of global fanout optimization and gate resizing problems are NP-complete. We 
proposed two efficient techniques for delay optimization via gate sizing: an LP- 
based algorithm [34] and an optimum pseudo-polynomial time algorithm for 
tree circuits under different rise and fall cell delays [3]. 

4. Physical Layout 

Placement is a key step in physical layout design to reduce the chip size and 
to improve routability. GORDIAN on page 499 demonstrated that partitioning- 
based placement can cope with the increasing chip size. The placement tool 
based on bipartitioning. Loose and Stable Net Removal [15], was developed for 
ASIC design in FLL. The papers on page 479 and page 535 proposed a system- 
atic way to optimize slicing-tree based floorplan and block packing respectively. 
These opened up doors to realizing an automatic floorplan generator. 

Routability was one of the most important objectives for routing systems in 
early 1990s. Touch and Cross routing algorithm [18] was proposed for better 
routability. A massively parallel routing hardware engine consisting of 16K pro- 
cessors was designed to speed up the algorithm for PCB designs. As for ASICs, 
a routing tool based on Touch and Cross was implemented in early 1990s. 

Timing closure has been a hot topic for designers since late 1990s. A timing- 
driven layout project was launched by FLL and FLA in 1995. It resulted in 
a new design paradigm, in which logic optimization is interleaved with place- 
ment/partitioning refinement and hierarchical global routing. This paradigm was 
integrated into Fujitsu’s ASIC design flow in 1999. As part of this research and 
development, new layout-aware algorithms for important, layout-friendly logic 
transforms such as gate resizing, net buffering, generalized demorgan transform, 
pin permutation, and gate decomposition were proposed [24, 28]. We started a 
follow-up project in 1999 to develop a more tightly integrated performance- 
driven layout framework, which enabled several tools to plug-in and commu- 
nicate with each other through a shared database. A common delay calculator, 
logic synthesizer, timing-driven placement tool and a global router have been 
integrated into the framework. ICCAD contributed several key technologies for 
these tools. For instance, for the logic synthesizer, the two-level minimization 
paper on page 205 and the multi-level minimization papers on pages 191 & 227 
provided the foundation. 

The quality of clock tree synthesis is also a key issue for circuit performance. 
The paper on page 509 proposed the basic idea to control clock skew, which 
still forms the basis of most of the existing clock tree synthesis tools. 
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5. Test 

Fujitsu has been active in testing research since 1975, starting from board- 
level testing for mainframes. D-algorithm, PODEM, and FAN were improved 
to make them applicable to large LSIs and board-level designs. To handle the 
growing complexity of designs, algorithm improvement and parallel processing 
have been applied. We introduced new test schemes such as RAM Function 
Test (which carried out data extension of the template pattern for every RAM by 
ATPG) and Dynamic Function Test (which paid attention to transition faults). 
We now describe the recent research results in design for testability (DFT), se- 
quential ATPG, and high-level test generation. 

DFT is a key technology that facilitates higher fault coverage and lower test 
cost. In 1995, we introduced the novel technique of cost-free scan [4], which 
reduced the area overhead of flip-flops by establishing the scan path through 
combinational logic and replacing scan flip-flops with non-scan flip flops, when- 
ever possible. LSI size is approaching 10 million gates and the performance of 
LSI-Tester has become a bottleneck. Recently, we invented BAST (BIST Aided 
Scan Test), which succeeded in compressing the amount of test data (by a factor 
of 10) and the test time. We also formulated several efficient DFT schemes for 
low-overhead testing [11, 14, 13]. These techniques resulted in 20-30% savings 
in test overhead over conventional techniques. 

Sequential ATPG still remains a challenging problem. We came up with 
a technique to significantly improve the performance of diagnostic ATPG for 
sequential circuits using dynamic fault collapsing [29]. A novel method, bi- 
nary time-frame expansion, was proposed [5], in which the behavior of a cir- 
cuit for t time frames, where t < n, is modeled by unrolling the circuit 2®, 2^ 
2 ^,... times and combining them. This method outperforms the con- 
ventional time-frame expansion by several orders of magnitudes on many de- 
signs. 

Test pattern generation at higher levels of abstraction is often more tractable 
than at the logic level because of smaller number of primitives. Efficient RTL 
ATPG techniques were proposed and integrated into an in-house tool [12], which 
is 2 to 3 orders of magnitude faster than the logic-level ATPG tools. An efficient 
methodology was proposed to generate test programs using the instruction-set 
description of a processor [6]. Experimental results show a substantial CPU 
time reduction for processor validation. 

6. Conclusion 

System-level design, verification, and ultra DSM problems have become sig- 
nificantly important. Hierarchical design methodologies for very large circuits 
are necessary but not sufficient. We still have several unresolved problems in 
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VLSI CAD. Fujitsu would like to continue investigating these challenging prob- 
lems and contribute to further success of ICCAD. 
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1. Introduction 

Over the last twenty years, ICCAD has been a major source of innovative 
ideas and valuable technical interactions for IBM, as well as a showcase for the 
many IBM advances in design automation. The quality of papers and presenta- 
tions has been unparalleled, as is exemplified by the selected papers in this book. 
As a company that depends on advanced design automation tools for its prod- 
ucts, and as an innovator in the design automation arena [7], IBM congratulates 
ICCAD on its 20th Anniversary and applauds the authors, tutorial presenters, 
and program and executive committees that have made ICCAD the best techni- 
cal conference in design automation. 

Figure 1 highlights some of the major advances in design automation over 
the last 30 years, from IBM’s point of view. Many tools originated in IBM 
to support its early use of advanced technology, while many were developed 
outside IBM and reported at ICCAD. In each of the four disciplines, we list the 
major areas of innovation on the first row, and the IBM and industry tools that 
benefited and contributed to this on the second and third row. A more detailed 
overview of this IBM perspective can be found in [7]. 

In the following pages, we look at those areas and highlight the papers that 
have had an impact on IBM. Certainly, many of the papers in this book fall into 
this category, as do many other excellent papers which have appeared in the 
proceedings over the years. 

2. Verification 

Verification gates the delivery of IBM’s products. It is of overwhelming im- 
portance and consumes the largest portion of our product development effort. 

The use of equivalence checking is an important complement to simulation. 
Since the late 1970’s, the ability to prove the equivalence of RTL design and 
lower-level implementations has allowed IBM to use efficient register-transfer- 
level simulation instead of much more costly gate-level simulation. Although 
equivalence checking in IBM [2] predated ICCAD, the use of reduced ordered 
binary decision diagrams (BDDs), has supplanted our earlier methods. Variable 
ordering is especially important in forming usable BDDs and the Rudell pa- 
per [30] was a fundamental work in describing efficient and effective methods 
for good variable orderings. 
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Figure 1. 30 years of algorithm and tool development. 



The definition of “cut points” in a logic network [33] was key to practical use 
on large industrial designs, since it allowed computationally expensive equiva- 
lence problems to be reduced to smaller ones. 

Checking equivalence between RTL and transistor-level net-lists has been es- 
pecially valuable for custom microprocessor designs [3]. Bryant [36] used four- 
valued symbolic analysis to perform such an extraction. This work, together 
with related efforts, has made formal equivalence checking a major contribution 
to industry that has helped contain the cost of verification. 

Model checking has also become a valuable adjunct to simulation. IBM’s 
RuleBase system [12] started with the work of McMillan [11] and has developed 
into a robust system with satisfiability solvers such as GRASP [32]. It is now 
frequently used on large complex designs to identify subtle errors early in the 
design process. 

3. Logic Design, Timing and Test 

In the 1960’s and 1970’s, transistor and gate networks were designed “by 
hand” and then simulated to confirm functional and timing correctness. The 
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convergence of logic synthesis, equivalence checking and static timing analysis, 
first used by IBM in the early 1980’s [5], [2], [4], eliminated much tedious work 
and opened the door for dramatic designer productivity gains. 

One of the most influential logic synthesis systems continues to be MIS [19], 
the multilevel optimization system developed at IBM and U. C. Berkeley. Fun- 
damental innovations in optimization methods such as weak division, technol- 
ogy mapping and rectangle covering for factoring came out of MIS and provided 
a “gold standard” for optimization work by a generation of researchers and com- 
mercial tool developers. 

While MIS used algebraic optimization methods, the first production logic 
synthesis system, LSS [5], relied more on “local transformations” to generate 
production-quality designs of product chips with manageable runtimes. IBM 
followed LSS with BooleDozer [8], which incorporated ideas from MIS, and 
added many other innovations, such as an extended form of the global flow 
analysis techniques [21] that were pioneered in LSS. In another extension. 
Brand [31] used a test-based approach to verification, that focused on deter- 
mining where changes have occurred, allowing the unchanged part of the design 
to remain stable. This was especially helpful for handling engineering changes 
in the synthesis process. Still further advances were based on the work of Ra- 
jski [22], which combined the decomposition and factorization steps into a sin- 
gle algorithm, and Watanabe [23], which did the same with decomposition and 
technology mapping. 

Retiming can effectively optimize logic by moving latches in a circuit to bal- 
ance timing and minimize cycle time. Leiserson and Saxe [6] established the 
basic principles for retiming, and Shenoy and Rudell [34] made it practical and 
efficient to apply to large-scale networks. In IBM, we have achieved signifi- 
cant design optimization using this method, but it has not been widely adopted 
because it interferes with our verification methodology. 

In the 1970’s, IBM pioneered the development of PLAs and supporting tools 
such as MINI [13]. Since then there has been much research in this important 
area [20]. PLA circuits and tools are still used in some high-performance de- 
signs. 

Overall, logic synthesis has certainly had the greatest impact in the electronics 
industry. It has allowed raising the level of design above the gate level and 
enabled much higher designer productivity. We all look forward to the next 
advance to the system level. 

4. Physical Design 

The Gordian system [26] sparked research in the area of cell-placement for 
very large chips. It introduced the principle of treating all cells simultaneously 
throughout placement. Algorithms based on this principle have been very ben- 
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eficial for large-scale ASIC designs in IBM. Min-cut algorithms are being used 
extensively for iterative improvements. Yang and Wong [29] show that network 
flow algorithms can be efficiently applied to large-scale partitioning problems. 

In more hierarchical designs, approaches based on rectangle packing have 
gained importance. The sequence-pair representation [27] is fundamental to 
most of these approaches. Simulated Annealing was developed in IBM and 
has been widely applied to improve designs incrementally and make them meet 
complex optimization criteria [10]. As an example, annealing is used in an 
hierarchical design-planning system for early floorplanning [25]. Clock-tree 
generation is an integral step in physical design today. Many of IBM’s clock- 
tree generation tools are based on the zero-skew principles laid out by Tsay [24]. 

The improved capacity of physical design tools has allowed increasingly 
larger designs to be created. Physical verification tool capabilities need to be 
kept in line with these larger design sizes. Goalie [28] introduced a fundamen- 
tal region analysis algorithm and “union-find” data structures that have allowed 
many of the operations intrinsic to design-rule checking and circuit extraction 
to be carried out very efficiently. 

5. Circuit Design 

Automated circuit tuning is extremely important for IBM’s microprocessor 
design groups. The posynomial approach to transistor sizing [35] marked the 
beginning of the automated circuit tuning era. Simplified delay models enabled 
the use of fast algorithms to solve the sizing problem. JiffyTune [37] allowed for 
circuit tuning in the presence of complex delay models. JiffyTune is a dynamic 
circuit tuning tool and has been used on many custom circuits in IBM. One of 
the key elements is the fast circuit simulator, SPECS, [38], which efficiently 
provides time-domain sensitivities. To use dynamic tuning, it is necessary to 
specify all input patterns which tuning is to consider. As a result, most IBM 
designers have moved to static circuit tuning, which is based on the same sen- 
sitivity calculation engine, SPECS, and a static timing analyzer EinsTLT [43]. 
The progress from manual circuit design to automated transistor-level tuning is 
truly remarkable and a major productivity improvement. 

6. Analysis 

Analysis tools are essential in any design project to guide designer decisions 
and to provide the foundation of later automation. Over the last 20 years there 
has been an increasing focus on interconnect analysis. IBM’s early use of high- 
performance packaging for its servers required considerable pioneering work in 
interconnect modeling, including inductance [9]. Today, many of these tech- 
niques have been adapted for chip designs. 
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O’Brien and Savarino [18] were among the first to recognize the need for re- 
duced models of interconnect networks and developed their Pi models for spe- 
cific networks. The AWE work of Rohrer and Pileggi [14] provided a more 
formal and general framework that has seen many significant enhancements, 
including the PRIMA work [17]. These model-reduction methods are incor- 
porated in IBM’s timing and noise analysis tools. The ‘Tast Henry” work of 
Kamon, et. al. [16] served as the performance benchmark for our extraction 
tools. 

While design in IBM is primarily digital, the development of SiGe technol- 
ogy has led to a growth in RF and analog design and the need for effective tools. 
Fortunately, many of the ICCAD advances in this area, such as Kundert’s har- 
monic balance work [15] have found their way into commercial tools that we 
use internally and support for our customers. 

7. System Design 

System design in IBM typically refers to the creation of very large multipro- 
cessor complexes with shared memory and banks of peripherals. IBM’s EDS 
was the first design system to attempt to extend automation beyond chips to 
packages, cards, boards and cables [7]. Today, complex systems are being de- 
veloped on single chips and the industry needs a new set of system-level tools 
to help designers make critical tradeoffs early in the design process. 

The IMEC work on DSP compilers called attention to the importance of op- 
timizing software for emerging DSPs and greatly influenced commercial DSP 
compilers that are essential for today’s embedded systems. Jacome [42] gave an 
excellent example of an optimization algorithm that can be developed given the 
appropriate framework for considering tradeoffs 

As “systems on a chip” become more complex and chips become larger, still 
more automation will be required to manage system design. The work by Car- 
loni, et. al. [41] described a correct-by-construction method that could help in 
organizing such chips in the future. 

Power dissipation has emerged as a critical concern for designers and power 
optimization is needed at all stages of design. The Hyper-LP work [39] provided 
an early focus on power and described an broad set of architectural and logic 
transformations for reducing power. These transformations have been widely 
used and rediscovered in later tools. Malik et. al. [40] introduced the notion 
of “power cost of software” and provided a simple, but important, model for 
optimization and considering design tradeoffs. 

8. Conclusions 

ICCAD has served as an important beacon of future EDA advances for 20 
years. Many critical ideas were first presented there. But there are many chal- 
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lenges ahead and more breakthroughs are needed. We look forward to further 
advances being reported at future ICCAD conferences. 
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MAGMA AND ICCAD 



Michel Berkelaar 
Magma Design Automation 

1. Introduction 

Magma Design Automation, founded in 1997, sells a complete IC design 
system (from RTL to GDSII in one tool) under the name Blast Fusion^*^. This 
tool contains RTL HDL (VHDL or Verilog) entry with RTL synthesis and data 
path generation, floorplanning, netlist level optimization, placement and routing, 
clock optimization, as well as a multitude of analysis engines. Among these the 
most important are; an incremental static timing analyzer, parasitic extraction, 
crosstalk noise analysis, power analysis, rail (voltage drop) analysis. 

This tool is based on two fundamental concepts: The Magma Unified Data 
Model and the FixedTiming ® Technology. The Magma Unified Data Model 
concept means that Magma’s design flow consists of a single executable, which 
stores all design information in one in-memory database. As a result, all design 
steps have constant access to all data. Synthesis and optimization steps have 
access to the placement and routing information, for example. There is no need 
for data exchange through external files during the entire design flow, which 
also means that no data is lost. This allows a design flow in which the design is 
incrementally refined, and in which traditionally sequential operations (such as 
logic synthesis and placement) can be interleaved. 

The FixedTiming Technology is fundamental to the way in which the timing 
of a CMOS circuit is calculated during a large part of the design flow in Magma’s 
tools. It uses concepts from research known as constant delay and logical effort, 
on which fundamental papers have been published in ICCAD in the past. Please 
refer to the section on this topic below. 

2. Why is ICCAD important to Magma? 

ICCAD is important to Magma for a number of reasons. First and foremost, it 
clearly attracts the best papers written by academic and industrial researchers on 
the subject of Computer Aided Design for Integrated Circuits. There are other 
conferences on this topic, but ICCAD always stands out for the high technical 
quality of the program. These papers often contain new ideas that can inspire 
industrial researchers. If they turn out to be practical in an industrial setting, 
they can make their way into industrial tools. In general, however, academic 
solutions need to be heavily adapted, as real-world circuits are much bigger 
and more complex in terms of structure (dozens of (generated) clocks, false- 
path and multi-cycle constraints, etc.) than those used in the results section of 
most papers. For this reason, many papers that show great results on existing 
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academic benchmark sets turn out to be completely impractical for the design of 
real ICs. 

During the ICCAD meeting itself, always strategically placed in the heart 
of Silicon Valley, every year a large number of the important researchers and 
industrial representatives for the topic gather, making it the prime meeting place 
to stay in touch with colleagues from all over the world. The world’s most 
promising Ph.D. students give presentations at ICCAD, making it also a very 
attractive (and intensively used) hunting ground for new employees. 

3. Papers important for the algorithms inside 
Magma’s tools 

In the following sections, we will look at which basic ideas and associated 
papers or books have been used in Magma’s tool suite. In all cases, one should 
realize that the actual implementations are heavily engineered to make them 
efficient and robust enough, as well as deliver the right level of quality. 

Many basic ideas predate ICCAD, so the most fundamental reference is not 
an ICCAD reference. In other cases, fundamental ideas were first published in 
other conferences, journals or books, but they are still mentioned here. In those 
cases, often ICCAD papers (sometimes many) do exist that take these ideas 
further. 



4. FixedTiming Technology 

For timing analysis and optimization during much of the flow. Magma uses 
the concepts of constant delay and logical effort as developed in [12] and [21]. 
These ideas are fundamental to the efficiency and accuracy of Magma’s timing 
optimization solution, and allow the optimization of huge circuits flat. If not for 
the inspiration derived from these ideas. Magma might not have been founded 
as a company. 

5. Formal Verification 

Magma does not market a formal verification tool, but an internal equiva- 
lence checker implementation exists for Quality Assurance checking. For an 
EDA company, this is a very important task. EDA customers do not take ver- 
ification errors lightly. [2, 3, 17, 4] are the most fundamental inspirers of our 
implementation. A good overview of this subject can be found in [15]. 

6. RTL synthesis 

For optimizations at the RTL level, the introduction of Data Flow Graphs or 
DFGs as an input-language independent intermediate representation that com- 
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bines data and control flow [10] was very important. This concept is applied in 
many RTL front-end tools, and Magma’s front end is no exception. 

7. Timing Analysis 

For timing analysis during optimization, Magma’s tools rely on the funda- 
mental concept of static timing analysis. [14] was probably the first paper to 
introduce this idea, and a good overview of this topic can be found in [16]. 

8. Logic Synthesis 

For logic synthesis, the basic idea of optimization by algebraic transforma- 
tions on Boolean expressions as developed by researchers at IBM and later in 
the University of California, Berkeley was fundamental to efficient and effective 
implementations. At the University of California, Berkeley, the famous MIS 
package was built using these ideas (see page 191 of this book), which served as 
a reference implementation that inspired many industrial logic synthesis tools. 
A more comprehensive description of the ideas used in MIS can be found in [5]. 

9. Clock tree synthesis 

For automatic synthesis of clock trees two basic principles are used in 
Magma’s flow. The first is the construction of a clock tree with minimal skew, 
of which the implementation is based on H-trees as introduced in [7]. We also 
use the more advanced notion of useful skew, where non-zero clock skew is in- 
troduced on purpose to optimize delay. Our implementation was inspired by 
[ 8 ]. 



10. BDDs 

Binary Decision Diagrams, better known under the acronym BDDs, form a 
basic data structure to store Boolean functions. Their use can be found in several 
places in Magma’s tools. Formal verification and technology mapping are two 
good examples. Fundamental were both the first papers to employ BDDs to 
represent Boolean functions [6], as well as the idea of efficient dynamic variable 
ordering to control their size (paper on page 51 of this book). 

11. Place and Route 

Over the past decades, ICCAD has been a very useful catalyst for the de- 
velopment of back-end (Place and Route) technology. The nature of back-end 
tools makes it hard to single out individual papers that have changed the back- 
end landscape. Good place and route technology is the result of experience and 
know-how. It is much less the result of a single algorithm, rather a carefully 
tuned flow along a few conventional algorithms. Magma employs several dif- 
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ferent placement algorithms in its flow. The basic ideas from [11] are used for 
initial placement. Ideas from [13] are used for incremental placement and the 
paper on page 479 of this book ([20] gives a comprehensive overview) is used 
for detailed placement. 

The most common routing algorithms are variations on Dijkstra’s vertex ex- 
pansion algorithm [9]. 

Placement and routing is not possible without the efficient storage and re- 
trieval of enormous amounts of 2-dimensional objects, and KD-trees are a fun- 
damental data structure used for this. They were introduced by Jon Bentley in 
[ 1 ]. 



12. Power analysis 

For power analysis in CMOS circuits, it is very important to know the switch- 
ing activities in the circuit. [18] introduced the first practical algorithm to esti- 
mate these activities, and a very similar algorithm is one of the options a Magma 
user has to estimate these activities. 

13. Conclusion 

From the above it is clear that ICCAD has been very important to Magma in 
many respects. We certainly hope and expect it will be the prime meeting place 
for the EDA R&D community for years to come. 
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DESIGNERS FACE CRITICAL 
CHALLENGES AND DISCONTINUITIES 
OF ANALOG/MIXED SIGNAL DESIGN 
AND PHYSICAL VERIFICATION 



Walden C. Rhines, Chairman and Chief Executive Officer 
Mentor Graphics Corp., Wilsonville, Oregon 

Introduction: 

Mentor Graphics Corporation is recognized as a mainstay in the electronic de- 
sign automation (EDA) industry, having developed leading-edge products that 
enable the design of electronic products for more than 20 years. Mentor’s break- 
through research in EDA ranges from high-speed board design to resolution en- 
hancement technologies for subwavelength manufacturability. 

Mentor’s strategic focus in analog/mixed-signal and physical verification 
and resolution enhancement technologies for system-on-chip (SoC) designs has 
been positively influenced by the advanced research conducted by contributors 
to ICCAD, as well as our experience with the real-time, real-work problems 
faced by IC designers in today’s world of complex engineering. 

This editorial authored by Walden C. Rhines, chairman and chief executive 
officer of Mentor Graphics Corp., Wilsonville, Ore. highlights examples of the 
most critical challenges and discontinuities addressed at ICCAD and that are 
currently challenging IC designers and EDA vendors alike. 

Challenges and Discontinuities in Analog/Mixed Sig- 
nal Design 

The digital chip has dominated industries from automotive to aerospace. But 
the recent market push in high-bandwidth communications technologies has led 
to a sharp increase in the use of analog/mixed signal (AMS) chips. Demand 
for AMS chips is expected to grow by 25 percent in the next few years, which 
means that engineers will need to shift their technology paradigms to meet the 
challenges ahead. 

To become tmly competitive in this fast-paced segment, designers will be 
required to make the transition to top-down design techniques that take into 
account the needs of digital and analog designers alike from the beginning of the 
design. These behavioral modeling methodologies were the topic of a six-person 
tutorial presented at ICCAD in 1999 [1]. The growing design complexities make 
it imperative for designers to use the newest analog HDLs for top-down analysis 
while retaining the bottom-up analysis of the transistor level. The discontinuity 
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is that analog and digital designers both need to move rapidly to adopt new 
methods in order to accurately create and test new designs. 

For example, many engineers are trying to implement analog features in low- 
cost digital CMOS chips. Typically, analog and digital subsystems are created 
separately, do not interact until IC layout and remain untested until fabrication. 
Any faulty interaction found at this point can result in expensive production 
delays and possibly in lost market opportunities. Fortunately, there are now 
tools available that support behavior modeling and standard analog modeling 
languages as well as mixed signal verification. Key solutions filling the gap in- 
clude behavioral model libraries, analog HDL languages and design simulators. 

Behavioral libraries mimic the behavior of a device and can be implemented 
at several levels of abstraction. Examples range from a simple op amp to a 
complex multipole zero op amp. Each model also offers dozens of parameters to 
enable virtually unlimited customization. If a model is not available to describe 
a proprietary design, a designer can use an analog HDL such as VHDL-AMS to 
create new models or write custom code. The development of an HDL reverse 
engineering tool-set is also a possibility for design data reuse [2]. 

The current trend toward joint partnerships and purchased intellectual prop- 
erty makes it important to use a simulator that accepts all standard HDLs, in- 
cluding Verilog, Verilog-AMS, VHDL, VHDL-AMS, SPICE and C-level mod- 
els. Language-independent simulators allow designers to reuse major portions 
of the analog or digital test bench in the full-chip verification. The models cre- 
ated for this design and refined for the verification are also ideal starting models 
for the next generation of the product. The models also improve and mature 
along with each generation. Adapting a model that is simple in design can be 
the key to quick and complete simulation in a mixed-signal simulator such as 
Mentor Graphics Eldo software [3]. 

Leading-edge companies have begun to make the transition from traditional 
analog tools, but because it is often more difficult to train people than it is to 
create new software, the industry as a whole has been slow to adopt new gen- 
eration tools. Both the learning curve and the initial capital investment can be 
a stretch. However, as the AMS market continues to grow, individual design 
engineers who have the foresight to increase their skills and the businesses that 
have the vision to plan for the future will outperform their competitors. 

There are many strong business reasons that should compel analog and digi- 
tal designers working in the AMS space to make the switch to top-down design 
using the newest developments in HDLs. Ever-increasing chip complexity and 
time-to-market pressures continue to stretch our current SOC development tech- 
niques. System-level design, especially mixed signal design and partitioning, is 
a major cause of schedule delays. The size of today’s analog designs makes it 
impossible for an engineer to go straight from specification to transistor-level 
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design. Utilizing top-down design with the new analog HDLs allows designers 
to increase their productivity. 

Designers facing the prospect of complex mixed signal chips should consider 
their own willingness to make radical shifts when necessary as they plan for 
the future. With continued acceptance and use of the new AMS tools, we will 
soon find that we have successfully traversed this discontinuity in analog/mixed 
signal design and moved on to the next challenge. Even better, a whole set of 
engineers will be even more indispensable to their employers because they have 
mastered another design challenge. 

Challenges and Discontinuities in Physical Verifica- 
tion 

Several years ago, design engineers were able to simulate and verify designs 
against a specific set of known realities. About four years ago, however, physical 
verification hit a major roadblock for those working on very large designs. As 
feature size continued to shrink, the requirement for more complex design rules 
grew. The tools in place at the time could not do the job. Designers who thought 
a job was almost complete were in a bind because they could not finish the 
verification. This was a discontinuity that design automation companies had to 
respond to almost without warning, and almost overnight. 

The move to deep-submicron work revolutionized everything. New factors 
had to be considered. As chips became more complex, polygon count exploded 
exponentially. This called for the creation of a fundamentally new architecture 
that was hierarchically based so that terabytes of information for physical layout 
could be compressed down to at least gigabytes. Designers had to move to 
parallel processing so that they could effectively scale and complete the task 
without using more memory in the process [4]. Further, as we delved deeper 
into the sub wavelength arena at 0.18 micron and below, we began to play tricks 
on modem physics. 

The EDA industry responded well. Modem photolithography uses wave- 
lengths of light that are larger than the smallest IC feature size to define features 
with the use of resolution enhancement technologies (RET), including optical 
proximity correction, phase shift mask, scattering bars and off-axis illumina- 
tion. As we move into the danger zone at 0.15 micron and below, semiconductor 
companies will combine different types of RET at different points in the process 
to extend the life of the lithography equipment as well as to ensure adequate 
yields. Although microlithography researchers around the world have long been 
addressing the issues [5], no one can yet predict where we will be in the coming 
years, although there is the likelihood of adding extreme ultraviolet and fluorine 
lasers to the RET mix. 
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One common misconception in the market today is that these factors will 
create the need to replace the entire EDA tool suite. Several companies sug- 
gested that designers would need to replace everything from synthesis to verifi- 
cation overnight and use their new tools because timing closure could never be 
achieved otherwise. Instead, designers have chosen an evolutionary approach. 
New generations of physical synthesis tools now make a more efficient use of 
gates and use buffer insertion to fix some timing issues. These products are 
changing the way synthesis is done, throwing away 20 to 30 percent of the gates 
that are unnecessary, looking at placement and bringing in static-timing analysis. 

Physical verification proves to us that no one can truly predict the future. 
Something that looks like a complete technological dead-end at one point, such 
as subwavelength manufacturing, proves to have revolutionary solutions, while 
areas where we expect major technological transformation end up functioning 
fairly smoothly with a few small adjustments. 

A few years ago, no one could have predicted how or if we could manage 
designs below the subwavelength of light, but here we are with tremendous new 
tools that allow designers to keep breaking down barriers. 
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K. Wakabayashi 

NEC Laboratories, NEC Corporation, Tokyo, Japan 

1. Introduction 

NEC is a premier semiconductor company with a tradition of strong internal 
technology and innovation in EDA. NEC has been an active contributor to de- 
velopments in the EDA community for more than 20 years. In particular, NEC 
researchers have contributed significantly as organizers, reviewers and program 
committee members at ICCAD. Notably, Dr. Satoshi Goto of NEC was the Pro- 
gram Chair and General Chair of ICCAD in 1990 and 1991, respectively. NEC 
has pro-actively developed in-house tools, published papers in premier EDA 
conferences and journals, and assisted EDA vendors with technology and funds 
to develop EDA tools. Like other companies with semiconductor operations, 
NEC has, also been a beneficiary of innovations showcased at premier interna- 
tional technical conferences like ICCAD. These innovations have consistently 
fuelled semiconductor design methodologies world wide across diverse industry 
segments - whether at the system level (personal computers, hand-held devices 
etc.) or at the device level (processors, memory, logic, MEMS, RF devices etc.). 

2. System Design and Test 

System-level design: NEC is addressing the challenges of SoC design 

by developing a C-based system design flow that assists system designers in 
behavioral system specification and simulation, system architecture template 
definition, behavior-to-architecture (component and communication) mapping, 
system-level performance and power estimation [1, 2, 3], and automatic op- 
timization of the system architecture through tuning of architectural template 
parameters [4, 5, 6, 7, 8, 9, 10, 11]. NEC developed the first comprehensive 
system-level design methodology for on-chip communication. A fast perfor- 
mance analysis technique for bus-based system-on-chip communication archi- 
tectures was published in 1999 ICCAD [12]. This was followed by a compre- 
hensive methodology for design of communication architectures for system-on- 
chips [13]. This work led to a Best Paper Award at the ACM/IEEE Design 
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Automation Conference in the year 2000. NEC also created the ACE-2 initia- 
tive to reduce turnaround time for ASIC SoC designs by up to two-thirds, from 
an average of 450 engineering months to fewer than 150 months. To realize 
the aggressive goals of ACE-2, NEC teamed with the world’s leading electronic 
design automation (EDA) companies to define a new design methodology for 
quick and accurate SoC design. We have also developed technologies that au- 
tomate custom architecture performance analysis [14, 15, 16, 17]. These tech- 
niques facilitate the use of extensible processors in SoCs. 

Advanced processing architectures: Due to limited CPU and battery re- 
sources, wireless hand-held devices are unable to provide real-time security 
processing support. The MOSES project develops a programmable, mobile 
security processing system that combines a novel hardware architecture with a 
tamper-proof, flexible software architecture to achieve security protocol acceler- 
ation that cannot be achieved by conventional cryptographic algorithm hardware 
accelerators [18, 19, 15, 20]. The code compression project CoCo compresses 
instruction code off-line, places compressed code in SystemLSFs memory and 
decompress the code on the fly, during system run-time [21, 22, 23, 24]. NEC 
continues to pioneer the discovery of new architectures for on-chip commu- 
nication and network switching fabrics [27, 28, 29] that have received criti- 
cal industry acclaim (for example, EE Times featured coverage of FLEXBAR 
technique [30, 31]). NEC has also developed MP98, a high-performance and 
low-power microprocessor technology for smart information terminals [32, 33]. 
NEC has pioneered the development of a unique, dynamically re-configurable 
processing architecture that finds wide applications in multimedia process- 
ing [34, 35]. 

High-level and RTL design: NEC’s high-level hardware design flow fea- 
tured the use of C-based hardware description language [36], well before the 
recent emergence of languages such as SystemC, providing designers with clear 
advantages in terms of ease of specification and simulation time compared to 
traditional HDL-based flows. The CYBER C-based design system [37, 36] has 
been in use internally in NEC for both ASIC and re-configurable (FPGA) de- 
signs for over a decade. It all began with a ground-breaking paper published 
in the 1989 ICCAD [38]. This was followed by numerous papers [39, 40] 
that underscored NEC’s commitment and conviction to move to higher levels 
of abstraction. Subsequently, a comprehensive performance-driven HLS flow 
for control-flow intensive designs was developed [41, 44, 46, 45], which in- 
troduced several new concepts and significantly improved the state-of-the-art 
in high-level synthesis. NEC pioneered a new paradigm for energy and per- 
formance optimization. The new paradigm was based on optimizing for the 
common-case computations in a behavior [42]. This work won the Best Pa- 
per Award at the 1999 ACM/IEEE Design Automation Conference. NEC also 
recognized the fact that behavioral descriptions often consist of multiple con- 



Industry Viewpoints 



665 



current processes, and developed techniques for performance analysis [47] and 
optimized synthesis [48] of multi-process behaviors. 

Low Power Design: NEC’s RTL power optimization and estimation tech- 
niques were one of the first attempts to address power issues at the RTL 
level [49, 50, 51, 52, 42, 52, 53]. These techniques inspired the research com- 
munity to focus on RTL power-optimization techniques. NEC has continuously 
driven the efforts on higher level power optimization techniques [54, 55, 56]. 

Rather than minimizing power consumption of single component in a sys- 
tem, NEC recognized that the power consumption of a complex SOC can only 
be minimized when the interdependencies of system components are taken into 
consideration [4, 5, 2, 57]. NEC also recognized the importance of hw/sw parti- 
tioning for low power [6] and the impact of operating system’s power consump- 
tion [58, 59]. The shift to higher level optimization techniques continued with 
optimizing the system for low power consumption, first published at ICCAD [8] 
and later refined into a suite of highly efficient methods [7, 9, 10, 1 1, 60]. 

Test and Design for Testability: NEC researchers pioneered the concept 
of scan in the 1960’s, and they continue to use it extensively. NEC’s test tool 
suite TRANGEN still enjoys superior performance over commercial EDA of- 
ferings. The tool suite is powered by innovations that were first proposed in 
a 1988 ICCAD paper that advocated use of quadratic programming and SAT- 
based techniques [73] for test generation as an alternative to the traditional 
path-sensitization based methods. This technique was championed and further 
refined [74, 75, 76, 77, 78, 79] in subsequent years by NEC researchers, cul- 
minating in the development of TRANGEN tool suite that is still in wide use 
in NEC for over eight years. NEC researchers proposed static [80, 81] and 
dynamic [82, 83] test set compaction algorithms that continue to be effective 
for large designs. NEC’s manufacturing test road map includes investigation 
of design for testability techniques other than full-scan. Over the years, NEC 
researchers have developed partial scan [84, 85] and synthesis for testability 
techniques [86, 87, 88, 89, 90] that are in use today in the development of high- 
speed supercomputers. The partial scan work received the Best Paper Award at 
the 1994 ACM/IEEE Design Automation Conference [85]. Research in delay 
testing [91, 92, 93] has culminated in the development of in-house delay testing 
tool suite that is in use for high-speed designs. NEC researchers have used DFT 
early in the design process to ensure test coverage and to reduce test develop- 
ment time [94, 95, 96, 97, 81, 98, 93]. NEC has been actively investigating 
BIST for SOC designs, with particular emphasis on low-overhead [99, 100], 
low-power consumption (during test application) techniques [101, 102, 103]. 
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3. Logic and Physical Design 

Seminal papers published at ICCAD allowed design abstraction to transi- 
tion from schematic entry to automated two-level/multi-level logic synthesis. 
This allowed designers to produce gate-level netlists that pushed physical de- 
sign tools to their limits. Papers on MIS, ESPRESSO-EXACT, and global-flow 
(pages 191, 205 and 217) have been the most influential and a large number of 
publications (pages 227 and 249) continued the work. Many papers at ICCAD, 
including the paper on page 235, made significant contributions in targeting FP- 
GAs. Techniques proposed in these papers are probably widely implemented in 
industry tools for FPGAs. NEC’s logic synthesis system Varchsyn was supe- 
rior to commercial EDA vendor tools for a long time. Research emphasis was 
on performance optimization of logic circuits using techniques like false-path 
elimination [61] and novel variations of retiming [62, 63, 64, 65], and logic sim- 
ulation [66, 67]. Today, NEC has adopted industry-standard logic design tools 
with limited internal research and development investments in logic design. 

ICCAD papers on physcial design like the GORDIAN paper (page 499) on 
quadratic programming techniques for placement and the floorplan design pa- 
per (page 479) on application of probabilistic algorithms for physical design 
have significantly influenced automation in physical design. As timing-driven 
placement became important, papers like “Exact Zero Skew” on page 509 sig- 
nificantly advanced the state of the art. NEC invested considerable amount of 
resources in developing superior, proprietary physical design capability. Sem- 
inal contributions in channel routing [68, 69] in the early eighties are still in 
wide use today. For the past two decades, NEC has been developing leading- 
edge physical design technology and contributing significantly to the placement, 
global routing and floorplanning phases [70, 71]. A notable contribution on 
floorplanning [72] recently won the prestigious award for the Best Paper in 
IEEE Transactions on CAD, 2001. 

4. Verification 

The ability to design is clearly predicated on the ability to verify. Without sig- 
nificant advances in functional verification, increasing the complexity of designs 
may become a pipe-dream. The EDA community has demonstrated in the last 
two decades that mathematical techniques can be effective in solving large-scale 
verification problems that arise in practice. 

Technologies developed for logic netlist verification and the verification of 
gate-level netlists against their RT-level specifications have almost done away 
with the need for gate-level simulation. It is our belief that this advance alone has 
saved the semiconductor industry countless dollars and many embarrassments. 
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Numerous papers on the development of Binary Decision Diagram (BDD) 
based (e.g. dynamic BDD- variable reordering paper on page 51 that demon- 
strates the locality of variable order exchange in a BDD) and ATPG/SAT-based 
techniques (e.g., GRASP paper on page 73 highlighting the benefits of conflict- 
analysis based non-chronological backtracking and learning) for answering sat- 
isfiability questions on logic expressions have of course played major roles in 
this advance. What was probably more crucial in making these technologies 
tractable in real life were the kind of ideas proposed in two early papers on 
pages 29 and 65 demonstrating ways to take advantage of partitioning and cir- 
cuit similarity. 

The next hurdle for the EDA community is to augment simulation with formal 
methods in the verification of temporal functionality. ICC AD in 1990 had three 
key papers in the same session on the practical aspects of the application of 
symbolic methods for state space traversal, one of which is included in this 
book on page 39. These papers have led to numerous follow on papers on BDD 
and SAT based techniques that have significantly advanced the ability to check 
temporal behavior of large designs, to the extent that today design blocks with 
about a million gates can be analyzed exhaustively for a few hundred clock 
cycles in a reasonable time at NEC. 

Deployment of formal techniques for verification continues to be a major 
strategic focus at NEC. It is NEC’s belief that access to the best verification 
technology is of vital importance to NEC semiconductor design business. NEC 
internally introduced formal equivalence checking of netlists using BDD-bsed 
techniques long before such tools became commonly available from EDA ven- 
dors. NEC’s BDD-based ZERO equivalence checking system developed by 
Akira Mukaiyama was somewhat of a pioneer in that respect. As functional 
verification has grown in importance, NEC researchers have explored many of 
its aspects in the last decade and contributed to the state of the knowledge and 
application (e.g., [104], [105], [106], [107], [108], [109] ). In the recent past, 
researchers at NEC have developed novel world-leading technologies for ana- 
lyzing temporal behavior of industry-scale dsigns that are being deployed within 
NEC. It is expected that such technologies will be instrumental in significantly 
bringing down the development time for complex new chips from the current 
multiple years. 
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Abstract 

We briefly discuss the influence of the International Conference on Computer-Aided Design 
(ICCAD) on the research program carried out at Philips Research over the past twenty years. 
We highlight some of the developments in the areas of simulated annealing, media-processor 
design, logic synthesis, and formal verification. 



1. Introduction 

During its 20 years of existence the International Conference on 
Computer-Aided Design (ICCAD) has provided Philips Research scientists with 
an international stage to exchange ideas and results with the research commu- 
nity. The conference not only enabled our early access to new results in the field, 
but it also gave our scientists the opportunity to present and discuss their own 
work. This two-way flow of information is characteristic of the role ICCAD 
has played within Philips Research. Over the years, about fifty Philips Research 
scientists attended ICCAD, and fifteen papers were contributed on a range of 
subjects, including mixed-signal IC design, simulated annealing, logic synthe- 
sis, high-level synthesis, device modeling, and simulation. 

The CAD effort within Philips Research over the past twenty years has been 
substantial. In the late 1970s, the early efforts on circuit simulation, layout, and 
testing resulted in the establishment of a CAD research department staffed with 
about twenty research scientists. In the early 1990, the growth of these develop- 
ments led to the start of a self-financing activity within Philips called Electronic 
Design and Tools. The role of this department is to develop CAD tools and 
act as a consulting department for the Semiconductor and Components prod- 
uct divisions within Philips. The introduction of efficient and effective tools for 
low abstraction level design tasks, such as circuit simulation and layout synthe- 
sis, shifted the CAD effort within Philips Research towards higher abstraction 
levels such as formal verification and high-level synthesis. This development 
complies with the general trend within the domain. It also has led to the integra- 
tion of CAD research with research on embedded systems design, diminishing 
the position of CAD as an independent field of research. 



676 



THE BEST OF ICCAD 



Below, we present four best practices illustrating different aspects of the role 
ICCAD has played over the past twenty years within Philips Research. The 
first two discuss contributions from Philips Research to ICCAD; the other two 
elaborate on the impact of research results presented at ICCAD on CAD devel- 
opments within Philips. 

2. Simulated annealing 

Almost immediately after Kirkpatrick, Gelatt and Vecchi [6] published their 
seminal paper on simulated annealing, the CAD community discovered the ap- 
proach as a very useful tool to handle hard optimization problems in IC de- 
sign. For many problems in routing, placement, and floorplanning simulated 
annealing proved to be more effective than existing approaches. Problems in 
gate-matrix and standard-cell placement called for new effective solution meth- 
ods that could deal with the rapidly growing instance sizes that resulted from 
the ongoing miniaturization in semiconductor technology. Despite its effective- 
ness, simulated annealing suffered from the disadvantage of being very time- 
consuming, which was inherently due to the probabilistic nature of the approach. 
Consequently, many researchers took up the challenge to improve the algo- 
rithm’s efficiency by introducing advanced cooling schedules that could guaran- 
tee fast convergence to near-optimal solutions. These efforts have led to many 
improvements, which were presented at ICCAD in the mid 1980’s. 

The classical cooling schedule of Kirkpatrick, Gelatt and Vecchi [6] uses a 
cooling parameter, also called temperature, which controls the probability of 
accepting newly generated solutions. The lowering of this control parameter 
during the algorithm’s execution has been the subject of investigation for many 
researchers. Otten and Van Ginneken (see paper on page 479) were the first to 
propose to replace the simple geometric lowering of the control parameter by an 
advanced lowering function making use of the concept of “quasi equilibrium” 
and of concepts borrowed from statistical physics. Aarts and Van Laarhoven [1] 
used the quasi equilibrium concept to develop a schedule that converged prov- 
ably in polynomial time. Huang, Romeo and Sangiovanni-Vincentelli [4] pro- 
posed a schedule using an exponentially fast reduction of the control parameter. 
With the introduction of the extremely fast schedule by Lam and Delosme [8] a 
point in the development of ever faster cooling schedules was reached where the 
complexity of the calculations needed to estimate new control parameter values 
could no longer be compensated by the gain in speed of the algorithm’s conver- 
gence. So, consequently, the interest in cooling schedule research diminished, 
and ICCAD was no longer the stage were new cooling schedule results were 
presented. 

Simulated annealing has grown mature to become a standard optimization 
method in many CAD tools. An excellent example is the TimberWolf placement 
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and routing package that was developed by Sechen and Sangiovanni-Vincentelli 
[12]. Within Philips several simulated annealing based tools have been used for 
a long time to cope with problems in IC design. Currently many tools dealing 
with low abstraction level problems such as routing and placement have been 
replaced with commercial layout packages, some of which still use simulated 
annealing. In several cases, where there are no commercial packages are avail- 
able, simulated annealing is often the preferred optimization tools because of its 
fast and flexible application in non-standard situations. 

3. Media-processor design 

Over the years Philips has invested quite some effort in the development of 
tools that support the design of Digital Signal Processors (DSPs) for multi-media 
applications. The effort was based on the belief that true silicon compilation 
could be achieved in certain application-specific domains such as DSP design. 
Early work at Philips Research dates back to the design of parametric filters for 
audio processing applications in the mid 1980s. Typical of the audio applica- 
tions was the fact that requirements for both flexibility and performance could 
be easily met in a single design. In video applications this was no longer the 
case because of the extremely high computational demands imposed by the tight 
throughput requirements. This introduced the problem of trading off flexibility 
and performance. 

One approach to this problem was the VSP chip with the corresponding tool 
set introduced by Essink, Aarts, Van Dongen, Van Gerwen, Korst, and Vis- 
sers [3]. The VSP was an advanced programmable DSP consisting of several 
VLIW processing units that could run in parallel. The corresponding tool set 
provided the user with an advanced programming environment that could be 
used to (re)program the chip. The flexibility of the VSP was certainly a big ad- 
vantage, but the intricacy of the programming environment and the large number 
of VSPs that were needed for real-life video applications hampered commercial 
success. Nevertheless, one can view the VSP as a predecessor of the Philips pro- 
grammable processor generation that includes the Trimedia and Nexperia chips. 

Another way to trade-off between flexibility and performance was to use tools 
that could substantially reduce the design time of custom ICs for video process- 
ing. The Phideo design methodology introduced by Van Meerbergen, Lippens, 
Verhaegh, and Van der Werf [11] , was an example of such an approach devel- 
oped at Philips Research. The approach used a collection of design tools that 
translate a high-level functional specification of a video algorithm, expressed in 
the Phideo Input Format, into VHDL descriptions of a set of processing units, 
memories, address generators and controllers. Two Phideo elements deserve 
special attention. Firstly, the approach made used of the mathematical model 
that expressed the repeated execution of tasks, which is typical for video ap- 
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plications, in terms of periodic operations. Verhaegh, Lippens, Aarts, Korst, 
Van der Werf, and Van Meerbergen [13] discuss the problem of finding appro- 
priate periodic schedules for one dimension of repetition. Verhaegh, Aarts, and 
Van Gorp [14] discuss the corresponding periodic scheduling problem for multi- 
dimensional periodic operations. Secondly, the approach applied improved op- 
timization techniques to find solutions to the well-known retiming problem in- 
troduced by Leiserson and Saxe [10]. Van der Werf, Peek, Aarts, Van Meerber- 
gen, Lippens, and Verhaegh [15] discuss this problem in the general setting of 
area optimization of multi-functional processing units. The Phideo approach has 
proved to be a successful design methodology and it was used to design several 
industrial ICs. The best known example is the Melzonic: an IC for motion- 
compensated field-rate upconversion by Lippens, de Loore, de Haan, Eeckhout, 
Huijgen, Loning, McSweeney, Verstraelen, Pham, Kettenis [9] and an IC for 
MPEG2 video encoding by Kleihorst,van der Werf, Bruls, Verhaegh, Waterlan- 
der [7]. 



4. Logic synthesis 

The logic synthesis effort at Philips Research has many roots in research that 
was first published at ICCAD. In the late 1980s, Philips Research started sev- 
eral research projects on logic synthesis. The aim of these projects was to de- 
velop software tools that could automatically translate functional specifications 
of controllers into circuits consisting of logic gates. Most of these tools ap- 
plied language transformation techniques that generated PLA tables that where 
optimized and matched onto a standard cell library. One of the first tools was 
PHIFACT, introduced by Crowet, Davio, Dierieck, Durieu, Louis, and Ykman- 
Couvreur [2], which was based on the techniques for multi-level logic synthesis, 
introduced by Brayton, Detjens, Krishna, Ma, McGeer, Pei, Philips, Rudell, Se- 
gal, Wang, Yung, Sangiovanni-Vincentelli (see page 191) and Rudell, Sangio- 
vanni Vincentelli (see page 205). 

OMA (Optimizer and MAtcher) was another logic synthesis tool employing 
concepts similar to those described by Brayton, Detjens, Krishna, Ma, McGeer, 
Pei, Philips, Rudell, Segal, Wang, Yung, Sangiovanni-Vincentelli (see page 191) 
and Rudell, Sangiovanni Vincentelli (see page 205). The tool used ELLA as 
the input language, and can be viewed as the first available ELLA synthesizer. 
Subsequent tooling efforts led to the development of a Verilog synthesizer called 
VSyn, and to the start of the development of a VHDL version of a logic synthesis 
tool, which, however, has not been completed. Several IC’s were designed using 
the logic synthesis tools mentioned above. Especially, in the design of circuits 
for telecom application the tools have been proved very successful. An example 
is a chip that could control the digital PRXD telephone switching station. 
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Currently, these tools are no longer used within Philips. As in the domain 
of routing and placement they have been replaced by commercially available 
software packages. This is a general trend in circuit design that holds true for 
many tools that have been developed within Philips Research. Initially, research 
groups create implementations of techniques that are of high importance for the 
company, but as soon as commercial tools become available they are replaced. 
The advantage of the commercial tools is their improved reliability and mainte- 
nance serviced by the vendors. The role of the in-house created tools is however 
significant because they enable electronics companies to reduce design times of 
ICs, thus enabling fast introduction of new products. 

5. Formal verification 

Although most techniques used in formal verification date back to the early 
1980s, it was only in the early 1990’s when the first industrial formal verification 
tools were introduced Philips. YATC, which was developed in 1993 at Philips 
Research, was one of the first working tools, and is widely used throughout 
Philips ever since. YATC as a tool is mainly used for equivalence checking. The 
objective of equivalence checking is to compare two designs, possibly at differ- 
ent levels of abstraction. Examples of such level comparisons are: RTL-to-RTL, 
RTL-to-GATE, and GATE-to-GATE. Equivalence checking provides a way to 
attain full correctness with respect to the functional equivalence of all steps 
down from the “golden” RTL model, i.e., no bugs are introduced in the syn- 
thesis process, during addition of clock trees, scan chain insertion, engineering 
change orders, etc. The YATC approach was based on techniques described by 
Berman and Trevillyan (see page 29) and work performed by Rudell described 
(see page 51), YATC was developed in close cooperation with research groups 
of the Eindhoven University of Technology, who contributed to the project with 
their vast knowledge on binary decision diagrams [5]. The YATC tool was built, 
to a large extent, on the knowledge about formal verification techniques that 
was publicly available in open literature. However special attention was paid to 
heuristics that would make the tool more suitable for relevant industrial designs. 
We believe that this task of tailoring academic knowledge to relevant industrial 
use is a role that can only be carried out with success by industrial research 
groups, because they have direct access to real-life circuit designs, whose use 
is mandatory to judge the techniques on their true merits, but which are often 
not available to academic research groups for several reasons. Below we discuss 
two remarkable success stories for formal verification within Philips. 

In 1997 circuit designers ran into a serious design problem, which took them 
several weeks of extensive simulations in the backend to find out what caused 
the problem. The actual formal verification of the block where the error occurred 
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took just 15 minutes. It turned out that the logic synthesis tool had swapped two 
bits deep inside the module. 

The second example relates to the chip set of the Philips car navigation sys- 
tem CARIN that was developed by Philips during the late 1990s. YATC was 
used to verify the controller part of the system, and this was done with great 
success because the tool found several serious errors. The designers used YATC 
because they were convinced that automatic formal verification tools could be of 
great use in obtaining correct designs. In this respect they were the first within 
the Philips design community to put faith and trust in such automatic tools. Cur- 
rently, YATC is widely accepted and it has become part of the standard design 
flow for digital chips within Philips. 

6. Conclusion 

Computer Aided Design is an extremely challenging field for industrial re- 
search, and ICCAD has provided a splendid international discussion forum for 
more than twenty years. We can only hope that the conference will maintain its 
prime position, and that it will be able to serve this most important role for as 
long as computer aided design remains a scientific research topic. 

References 

[1] Aaits, E.H.L, and PJ.M. van Laarhoven [1985], A new polynomial-time cooling schedule, 
Proc. IEEE Int. Conf. on Computer-Aided Design, Santa Clara, 206-208. 

[2] Crowet, F., M, Davio, C. Dierieck, J. Durieu, G.Louis, and C. YkmanS-Couvreur [1990], 
PHIFACT-a Boolean preprocessor for multi-level logic synthesis. Proceedings of the IEEE 
Int. Conference on Computer-Aided Design, Santa Clara, 506-509. 

[3] Essink, G., E. Aarts, R. van Dongen, P. van Gerwen, J. Korst, and K. Vissers [1991], Schedul- 
ing in programmable video signal processors. Proceedings of the IEEE Int. Conference on 
Computer-Aided Design, Santa Clara, 284-287. 

[4] Huang, M.D., F. Romeo, and A. Sangiovanni-Vincentelli [1986], An efficient general cool- 
ing schedule for simulated annealing. Proceedings of the IEEE International Conference on 
Computer-Aided Design, 381-384. 

[5] Eijk, C.A.J. van, Janssen, G.L.J.M. [1995], Exploitating structural similarities in a BDD- 
based verification method. Proceedings TPCD 94, ed. T. Kropf and R. Kumar, vol 901 of 
lecture notes im Comp. Science, Springer Verlag 

[6] Kirkpatrick S., Gelatt C., Vecchi, [1983] Optimization by simulated annealing. Science 220 

[7] Kleihorst R.P., A van der Werf, W.H.A. Bruls, W.F.J. Verhaegh, E. Waterlander, MPEG2 
video encoding in consumer electronics [1997], J. VLSI Signal Processing, vol 17, no 2-3, 
pp. 241-253. 

[8] Lam, J., and J.-M. Delosme [1986], Logic minimization using simulated annealing, Proc. 
IEEE Int. Conf. on Computer-Aided Design, Santa Clara, 348-351. 

[9] Lippens. P.E.R., B. de Loore, G. de Haan, P.Eeckhout, H. Huijgen, A. Loning, B. Mc- 
Sweeney, M. Verstraelen, B. Pham, J. Kettenis, [1996], A video signal processor for motion- 



Industry Viewpoints 



681 



compensated field-rate upconversion in consumer television, IEEE J. Solid-State Circuits, 
vol 31. pp 1762-1769 

[10] Leiserson, C.E., and J.B. Saxe [1991], Retiming Synchronous Circuitry, Algorithmica 6, 
5-35. 

[11] Meerbergen J.L. van, RE.R. Lippens, W.F.J. Verhaegh, A. van der Werf [1995], 
PHIDEO; High-level synthesis for high throughput applications. Journal of VLSI Signal Pro- 
cessing, vol 9, pp. 89-104. 

[12] Sechen, C., and A.L. Sangiovanni-Vincentelli [1985], The TimberWolf Placement and 
Routing package, IEEE Journal of Solid State Circuits 20, 510-522. 

[13] Verhaegh, W.F.J., P.E.R. Lippens, E.H.L. Aarts, J.H.M. Korst, A. van der Werf, and J.L. van 
Meerbergen [1992], Efficiency improvements for force-directed scheduling. Proceedings of 
the IEEE Int. Conference on Computer-Aided Design, Santa Clara, 286-291. 

[14] Verhaegh, W.F.J. E.H.L. Aarts, and P.C.N. van Gorp [1998], Period assignment in multi- 
dimensional periodic scheduling. Proceedings of the IEEE Int. Conference on Computer- 
Aided Design, San Jose, 585-592. 

[15] Werf, A. van der, M.J.H. Peek, E.H.L. Aarts, J.L. van Meerbergen, P.E.R. Lippens, and 
W.F.J. Verhaegh [1992], Area optimization of multi-functional processing units. Proceedings 
of the IEEE Int. Conference on Computer-Aided Design, Santa Clara, 292-299 



CONTRIBUTIONS FROM THE 
“BEST OF ICCAD” TO SYNOPSYS 



Raul Camposano, Ahsan Bootehsaz, Debashis Chowdhury, 

Brent Gregory, Jim Kukula, Narendra Shenoy and Tom Williams 
Synopsys, Inc., 

Mountain View, CA, USA 



Abstract 

This paper highlights some examples of contributions from the selected ICCAD papers to Syn- 
opsys. We do not attempt to be exhaustive. Given the breadth of the topics addressed at ICCAD 
and the number of products offered by Synopsys this would be very difficult. We also try not to 
make any value judgments on the importance of a topic: In terms of (current) economic relevance 
markets speak for themselves, predicting future relevance accurately is impossible, and judgment 
on the technical relevance is subjective. So the given examples merely reflect the papers and 
products the authors are most familiar with. We hope however, that they do illustrate what it takes 
to create practical design technology (tools) starting from outstanding ideas. 

Synopsys creates leading electronic design automation (EDA) tools for the 
global electronics market. The company delivers advanced design technologies 
and solutions to developers of complex integrated circuits, electronic systems 
and systems on a chip. Synopsys also provides consulting and support services 
to simplify the overall IC design process and accelerate time to market for its 
customers. Synopsys participates vigorously in all EDA markets addressed by 
the ICCAD paper selection: analysis, system, logic, circuit, physical, functional 
verification, timing, and test. Synopsys is among the largest EDA companies 
with a total market share of over 30%. 

Logic Synthesis was Synopsys’ first area of focus and continues to be one of 
our main interests. Our Design Compiler™ and Physical Compiler™ product 
family compile Verilog and VHDL into optimized, technology mapped and (in 
the case of Physical Compiler™ ) placed netlists, and they have been widely 
adopted for design. These products have incorporated many ideas from the 
ICCAD paper list, e.g. [1, 2, 3, 4, 5, 6]. Boolean logic optimization / min- 
imization techniques such as the ones implemented by Brayton et.al. [1] and 
Rudell [2] became a central component of Synopsys’ logic synthesis tools start- 
ing from the first release. These techniques led to automatically generated cir- 
cuits that could rival those generated by hand, leading to widespread adoption 
of this productivity improving technology. From the start, improved clock speed 
and chip area were concerns of every designer using logic synthesis. The global 
flow techniques described by Berman et. al [7] led to large improvements over 
what was possible with the earlier generation logic synthesis tools. This helped 
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broaden the appeal of this new technology to the most performance sensitive de- 
signers. Ordered BDDs are extensively used in logic synthesis (as well as many 
other EDA applications) and the work by Rudell [5] was certainly a milestone 
in this area. One less known example is the work of Shenoy and Rudell [6] 
which documented for the first time that retiming could be used on circuits of 
practical interest. Until then, the concept of retiming was largely of theoreti- 
cal interest and the impact on the design community was marginal. This work 
revisited the problem from the viewpoint of determining an efficient implemen- 
tation (although the theoretical complexity is slightly worse than the original 
algorithm by Leiserson and Saxe). This effort resulted in the ’’Behavioral Re- 
timing” concept introduced in Behavioral Synthesis (Behavioral Compiler™) 
and Logic Synthesis (Design Compiler™). Although designers are usually hes- 
itant to allow tools to change register boundaries, the benefits of this approach 
(speed and area) have led to its successful application particularly to pipelined 
data path circuitry where initial state is not a concern. 

Over the last five years Static Timing Analysis (STA) has become part of 
the standard design flow and is used extensively for timing “sign-off” (between 
the designer and the IC manufacturer). Computing the delay of interconnects 
is key in STA, and the O’Brien and Savarino’s ideas [8] in this area influenced 
our product PrimeTime™. Another necessary step to reduce the amount of de- 
lays dealt with to a practical size is model order reduction, the contributions by 
Odabasioglu et. al [10] and by Feldman and Freund [11] were certainly very in- 
fluential in this field. Through the late eighties and early nineties, circuit design 
with latches became a popular style. It enabled designers to ’’borrow” compu- 
tation time from the preceding and following stages during the active period of 
the latch. However such a design style is hard for timing verification, as paths 
can sneak through multiple levels of latches. Sakallah et al. provided a nice 
framework to do analysis of such circuitry. The problem of timing verifica- 
tion at the latch boundaries was resolved to be of polynomial complexity by the 
work of Szymanski and Shenoy [9]. The major restriction is that all clocks are 
required to operate at the same frequency. The authors showed that it is impor- 
tant to initialize the departure times at the latches to the earliest launch times 
at the start of the iteration for correct results. Although the restriction on the 
clock frequency makes the algorithm less relevant in the industrial environment 
with multiple clock frequencies, almost all timing analysis algorithms have the 
iterative scheme and initialization process presented by the authors. 

Formal verification appears today in the Synopsys product line principally in 
the Formality™ equivalence checking tool and in the FormalVera RTL verifi- 
cation tool. The BDD technology of Rudell [5], the SAT technology of Mar- 
ques Silva and Sakallah [12], the detection and exploitation of internal equiva- 
lences described in the paper of Berman and Trevillyan [13] and the paper of 
Brand [14], and the diagnosis method of Madre, Coudert and Billon [15], all of 
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these are building blocks that a state of the art industrial equivalence checker 
like Formality must use to meet customer needs. FormalVera is a semi-formal 
functional verification tool that aims to improve the effectiveness of RTL veri- 
fication by supporting coverage-driven test generation and formal property ver- 
ification. Some of the techniques used in FormalVera for checking functional 
correctness of digital designs with respect to specifications can be traced back 
to Coudert and Madre [16], along with the fundamental BDD technology of [5]. 

In addition, Marques Silva and Sakallah’s algorithm to solve Satisfiability [12] 
also influenced FormalVera. Beyond these products with explicit formal verifi- 
cation function, other Synopsys products in the synthesis and verification areas 
incorporate the same BDD and SAT technology to analyze, solve, and transform 
Boolean functions. 

The VCS™ Verilog simulator provides the performance, capacity and built-in 
coverage metrics required for verifying multi-million gates SoC designs. In ad- 
dition, VCS7.0 has introduced native support of assertions, test bench features, 
observed coverage and a direct, fast link to C applications (DirectC, CycleC). 
VCS7.0 supports Verilog, VHDL, mixed-HDL (MX) and mixed-signal simu- 
lation for complex SoC designs. We see this trend of integrating simulation 
with formal approaches as one of the key directions in functional verification 
today, which has been strongly influenced by the formal analysis and asser- 
tion/property specification [16]. VERA®, the test bench automation tool from 
Synopsys, addresses the support and automation of generating simulation vec- 
tors. VERA® is based on the non-proprietary, open standard Open Vera™ hard- 
ware verification language. VERA® uses BDD techniques [5] and BDD-based 
constraint solver technology. 

The papers by Strojwas et.al. [17] and by Antreich and Graeb [18] provide in- 
sight to improve the overall yield of a fabrication process and the overall robust- 
ness of the design. In today’s fabrication processes variations typically exceed 
the cases contemplated by these papers, and failure data collection to help diag- 
nostics are increasingly popular. Synopsys’ ATPG tool, TetraMAX®, coupled 
with automatic test equipment or a tester collects such data and offers diagnos- 
tics information to isolate the problem areas in failure analysis. TetraMAX® 
also includes a delay test package, which allows the user to determine what kind 
of delay path test he or she requires. It uses a robust path delay that covers a pre- 
specified critical functional paths or a transition delay test which will target all 
slow to rise and slow to fall delay faults in the entire network. Kundu et.al [19] 
address delay test, dealing with how to find a delay test for a path that is not 
invalidated by transient signals caused by arbitrary circuit delays and/or timing 
skews in input changes and looking at the full coverage of such tests relative to 
stuck open and stuck on transistors. 

Finally, in the area of transistor-level design, we would like to mention two 
contributions. Bryant’s techniques for extraction of gate level models from tran- 
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sistor circuits [20] certainly influenced our transistor-level static timing analysis 
tool PathMill™. AMPS™, our tool to optimize circuits by transistor sizing , 
offers similar functionality than JiffyTune described by Conn et.al. [21]. 

In summary, we can cite numerous ideas published in the “Best of ICCAD” 
contributions to commercial EDA tools and our product line in particular. 
ICCAD’s long tradition includes papers with ideas that were put to work al- 
most immediately; others were published long before their use in commercial 
products and for some their time has yet to come. 
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Abstract 

This paper examines the unique impact of Computer-Aided Design on Xilinx and on Field 
Programmable Gate Arrays. A brief history and description of FPGA implementation software is 
given. Some specific CAD algorithms that have influenced FPGA implementation software are 
listed along with their contributions. 



1. Introduction 

Field Programmable Gate Arrays (FPGAs) have revolutionized digital design 
and are popular because of their programmability, fast time-to-market, and low 
fixed costs. Founded in 1984 to pioneer a revolutionary new technology, the 
field programmable gate array (FPGA), Xilinx fulfills more than half the world’s 
demand for FPGAs. Today, Xilinx develops, manufactures, and markets a broad 
line of advanced FPGAs and software design tools. 

As developers of advanced custom ICs, Xilinx employs a variety of advanced 
CAD tools for the design of FPGA chips. As a developer of software design 
tools for FPGAs, Xilinx also provides advanced CAD tools for users to imple- 
ment their designs on the programmable fabric of FPGAs. Xilinx is, therefore, 
both a consumer and a developer of advanced CAD tools. CAD is, therefore, 
a very critical factor in the success of Xilinx and FPGAs. Innovation has al- 
ways been at the centerpiece of Xilinx’ culture and we at Xilinx congratulate 
ICCAD on its 20th anniversary of being the premier conference bringing forth 
fresh innovative ideas in CAD. 

For the first few generations, FPGAs were developed on mature process tech- 
nologies that lagged the leading edge technologies in performance and area, but 
had better yields. In the past few years, however, FPGA chips are being im- 
plemented in leading process technologies, and are increasingly being used as 
process drivers for these leading process technologies. FPGAs are being used 
as process drivers for leading process technologies for several reasons: they are 
built on mainstream CMOS technology, they are fabricated in large volumes, 
and they are highly regular and densely laid out chips. Fabricating FPGAs in 
leading edge technologies requires the use of cutting-edge EDA tools. Several 
of these cutting edge EDA tools have had their origins in ICCAD papers. 
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Apart from using EDA tools to design and develop FPGAs, Xilinx also de- 
velops FPGA implementation software that lets users design their systems and 
program the FPGA accordingly. In this paper, we will highlight different CAD 
papers and techniques that have played a significant part in the development of 
FPGA implementation software and contributed to the success of Xilinx and 
FPGAs. 



2. Impact of CAD on FPGA implementation 
software 

When Xilinx invented FPGAs, there were no EDA tools available for users to 
implement their designs on FPGAs. Users had to manually implement their de- 
sign on an FPGA. They would bring up a graphical tool that displayed a model of 
the FPGA device. They would then proceed to configure specific logic elements 
and routing resources to implement their design. After configuring different 
parts of the FPGAs, users would then direct the tool to automatically generate 
the bitstream necessary to program the device. Even for the relatively small gate 
densities of FPGAs available at that time, this manual method was exceedingly 
error prone and tedious. 

Automation came in the form of CAD tools that allowed users to enter their 
design on a standard schematic editor. The netlist generated by the schematic 
editor was processed by automatic tools that placed and routed these designs, 
and also generated the corresponding bitstreams. In the beginning, these tools 
were developed by Xilinx because the FPGA market was small compared to the 
size of the ASIC market and, consequently, there were no other EDA vendors 
developing FPGA-specific CAD tools. 

Since the first FPGA-CAD tools were originally developed by Xilinx and 
other FPGA vendors instead of ASIC-EDA vendors, they have evolved differ- 
ently from the ASIC tools. Consequently, the FPGA implementation flows have 
also evolved differently from the ASIC flows. As FPGAs became more popu- 
lar, the implementation flows required to implement user designs became more 
sophisticated and started to resemble ASIC flows. Consequently, external EDA 
vendors started supplying some CAD tools for FPGAs. Typically, these tools 
were small FPGA-related modifications to the tools that they already sold to the 
ASIC market. Today, however, there are several EDA vendors that supply CAD 
tools that are designed exclusively for FPGAs. 

The history of CAD tools for FPGAs is important to understand the evolution 
of FPGA-CAD algorithms. Initially, FPGA-CAD algorithms started out by ex- 
clusively adapting standard ASIC CAD algorithms for use in FPGAs. However, 
over time algorithms were developed that were exclusive to FPGAs. There is 
now a thriving research effort focused exclusively on FPGA-related CAD tools. 
Interestingly, in this transition from adapting standard ASIC-CAD algorithms to 
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developing FPGA-specific CAD algorithms, the FPGA-CAD community has 
developed several CAD algorithms that have not only found applications in 
ASIC-CAD but have become seminal works for the general CAD community. 

It is important to understand the FPGA implementation process before we can 
illustrate how specific algorithms have influenced FPGAs. A typical FPGA de- 
sign flow is shown in Figure 2. The FPGA implementation flow can be divided 
into 3 main phases: design entry, design implementation, and design verifica- 
tion. The design entry phase is identical to the ASIC design flow and consists of 
design entry tools such as HDL, schematics etc. The design veriflcation phase 
is also similar to the ASIC flow and consists of verification tools such as simu- 
lation tools. Since they are similar to standard ASIC tools, we will not discuss 
these tools in this paper. This paper discusses only the design implementation 
phase of the FPGA design flow. 



Design Implementation 




Figure 1. FPGA Design Flow 



The FPGA design implementation phase can be divided into synthesis, place- 
ment, and routing, followed by bitstream generation. In the logic synthesis 
phase, an input HDL description is synthesized and mapped into logic elements 
such as Lookup tables. Flip-flops, lOs, Multiplexors etc. that are the basic build- 
ing blocks of the target FPGA architecture. The resulting netlist consists of these 
logic elements connected together to implement the user design, and is used as 
input to the placement phase consisting of placing these elements on the FPGA 
sites that implement these logic elements. The placement of the logic elements 
in the design netlist on the logic element sites on the FPGA dictates the config- 
uration of those sites. After all the logic elements are placed appropriately on 
sites on the FPGAs, they are connected together in the routing phase. Routing 
determines how the routing fabric must be configured to achieve the connections 
that implement the design. Together, the configuration of the logic sites and the 
routing fabric constitutes the bitstream for the FPGA that can be loaded on the 
device to implement the user design on the FPGA. 
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2.1 Synthesis and Technology Mapping 

Initially, the synthesis and technology mapping for FPGAs was heavily influ- 
enced by the Espresso paper by Rudell et al. on page 205 and the MIS paper 
by Bray ton et al. on page 191. One of the earliest algorithms for FPGA specific 
technology mapping was presented by Francis et al. in [4]. The objective of their 
work was to perform technology mapping in order to minimize the number of 
LUTs (area minimization). Their main contribution was the combination of de- 
composition and covering LUT FPGA technology mapping. They enhanced the 
algorithm to do delay minimization (reduce the number of logic levels) in the 
Chortle-d algorithm presented ICCAD 1991 [5]. The FLOW-MAP algorithm 
by Cong et al. on page 235 optimally solved the LUT-based FPGA technology 
mapping problem for depth minimization for general Boolean networks. This 
was a seminal result because it presented a polynomial time algorithm for solv- 
ing the FPGA-specific (LUT-based) technology mapping problem compared to 
the general technology mapping problem which is NP-hard. 

2.2 Placement 

The placement problem for FPGAs, as mentioned earlier, consists of placing 
logic elements onto the configurable logic sites on the FPGAs. The placement 
problem, therefore, is very similar to the placement problem for ASICs. Con- 
sequently, the standard placement algorithms such as min-cut based approaches 
[7], annealing-based approaches [8] such as TimberWolf [11], and quadratic 
placement based approaches such as GORDIAN on page 499 and Ritual [12] 
were employed for FPGA placement. However, due to the complex constraints 
inherent in the FPGA placement problem, annealing-based approaches were the 
most popular algorithms for FPGA placement. One of the earliest published 
works on placement for FPGAs was Betz et al. in [l].They developed a tool 
called VPR that did logic element packing, placement, and routing for FPGAs. 
VPR was used not only as a FPGA-CAD tool but also as a tool that could eval- 
uate FPGA architectures. 

2.3 Routing 

While the FPGA placement problem is very similar to the ASIC placement 
problem, FPGA routing is very different from ASIC routing. Initial FPGA rout- 
ing approaches were adaptations of basic maze routing methods [9]. The routing 
model of an FPGA is represented as a connectivity graph where the nodes of the 
graph are the routing segments while the edges are the programmable intercon- 
nection points. The FPGA routing problem is essentially one of embedding the 
netlist on the underlying router graph. This routing model is the main reason 
why FPGA routing differs from ASIC routing: in ASIC routing, a route is ex- 
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pressed in terms of the underlying rectilinear grid, but in an FPGA, a route is 
expressed in terms of a path in the underlying routing graph. 

Brown et al. [2] developed one of the first routers for FPGAs. The 
PATHFINDER algorithm by Ebeling et al. [3] presented a novel method for 
FPGA routing that has since become the most popular FPGA routing algorithm. 
In this algorithm, nets are routed sequentially under the assumption that all re- 
sources are available to every net. After the first iteration, a cost is imposed on 
the resource corresponding to the demand of nets on the resource. With this new 
cost, another iteration is started. The costs of the resources are monotonically 
increased based on the demand for them. This algorithm is especially suited 
for FPGAs since it can take a global view of the finite routing resources on the 
FPGA. 

Recently, there have been several novel approaches to FPGA routing based 
on the GRASP algorithm by Silva et al. on page 73. Gi-Joon et al. in [10] 
formulates the FPGA routing problem as a Boolean Satisfiability Problem and 
uses the GRASP SAT solver to solve the routing problem. 

2.4 Contributions of FPGAs to CAD 

Most of the work on FPGA CAD has had its roots in ASIC CAD. However, 
there have been instances where research spawned by work on FPGAs has made 
significant contribution towards ASIC-CAD. One such example is the work by 
Frankie on iterative and adaptive timing slack allocation [13] where he proposed 
a generalization of the limit-bumping algorithm and then showed how lower and 
upper bounds on connection delays could be used in performance-driven layout 
improvement. This work was initially done for FPGA routing but is now a 
highly referenced paper for timing slack allocation in timing-driven placement 
and routing for both ASICs and FPGAs. 

3. New Architecture Definition 

One of the unique ways in which FPGAs depend on CAD is in the area of 
FPGA architecture development. New FPGA architectures must be defined and 
evaluated for cost, routability, and performance. This evaluation is done, typ- 
ically, with the help of CAD tools. In a typical evaluation of a new FPGA 
architecture, the different parameters of the FPGA (logic elements, routing re- 
sources etc.) are set. Using CAD tools modified to support the new architecture, 
designs are implemented on the FPGA and analyzed. Based on this analysis, the 
architecture parameters are modified and optimized to get the best performance 
on a set of designs on which the architecture is evaluated. 

Betz et al. [1] have done pioneering work in this area. With the help of 
CAD tools, the functionality of the logic elements, the size and area of the logic 
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elements, and the routing network between the logic elements can be modified 
and evaluated for cost, performance, and density. 

4. Future role of CAD 

FPGAs are growing increasingly dense and complex. FPGAs today no longer 
consist of simply gates and routing. Instead, they have several other system 
level features such as processors, gigabit transceivers etc. implemented on them. 
This increasing complexity puts a tremendous onus on CAD tools to efficiently 
implement user designs onto these large and complex FPGAs. 

The performance of the implemented designs and the runtime of the FPGA- 
CAD tools are two of the cornerstones for the success of FPGAs. Designs im- 
plemented on FPGAs can have an inherent performance disadvantage over those 
implemented on ASICs. Consequently, it is critical for the success of FPGAs 
to narrow this performance gap. Further, since one of the primary reasons FP- 
GAs are becoming popular is their faster time-to-market, it is also important for 
FPGA implementation tools to reduce their execution times to allow for faster 
time-to-solution while operating on increasingly large designs. These aggres- 
sive performance and runtime objectives require significant advances in CAD 
algorithms. 

A new area in which FPGAs have become very popular is in the area of re- 
configurability. Due to their inherent reconfigurability, FPGAs are ideal vehicles 
for implementing reconfigurable systems. CAD for reconfigurable systems is a 
relatively new area that will drive the acceptance of reconfigurable computing 
into the mainstream. 

In the past 20 years, ICCAD has been a fertile publishing avenue for research 
in CAD. We believe that revolutionary CAD ideas and techniques that will orig- 
inate from ICCAD in the future will provide the breakthroughs in these areas 
and help propel FPGAs and Xilinx to greater success. 
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quasi-delay-insensitive, 147 
RF, 276, 373 

simulation, 288, 303, 313, 347, 383 
RF, 383, 635 
symbolic, 18 

sizing, 347, 587, 635, 686 
synchronous, 552 
MOS, 295 
synthesis, 313 

topology, 270, 276, 313, 501 
Clearance checking, 493 
Clique partitioning, 1 14 
Clock 

multi-phase clocking, 597 
planning, 472 
schedule, 552 
optimization, 598 
verification, 598 
skew, 473, 509, 636, 641 
minimal, 655 
routing, 515 
useful, 473, 655 
zero, 473, 515, 636, 648 
tree, 472, 510, 636, 648, 655, 679 
CoCo Code compression architecture, 664 
Combinational gate, 296 
Conununication architecture tuners, 663 



Conflict diagnosis, 10 
conflict-induced clause, 81 
non-chronological backtracking, 78 
Conjugate gradient method, 357 
Constraint 
cluster capacity, 160 
face, 185 

interconnect capacity, 161 
partitioning, 502 
propagation, 78 
Convex function, 348 
Convex program, 273 
transistor sizing as, 299, 635 
Convolution, 352-354, 362 
Critical path, 551 
estimation of, 124 
impact on power, 120 
Crossing time, 350 
Current 
bias, 277 

piecewise approximate, 304 
Custom design, 347 
Cut 

fundamental cutset, 306, 310 
height of, 237 
in a flow network, 242 
/(T-feasible, 237 
transformation of, 240 
volume of, 237 

CYBER synthesis system, 664 
Cycle 

zero weight, 604 
Datapaths, 684 
clustered, 162 
parameterization of, 162 
Data precedence, 108 
Decomposition, 227 
algebraic, 183-184, 260 
AND2/1NV, 251 
Cholesky, 589 
gate, 237 
Roth-Karp, 237 
using Huffman coding, 238 
Deep Sub-Micron, 143 
See also DSM 
Defects, 557 
Delay 

circuit path, 299 
combinational block, 299 
complex gate, 298 
constant, 654 
Elmore, 297, 369, 510 
interconnect, 393 
optimization, 235 
optimum; 184 
testing, 182, 556, 665, 685 
zero/unit model, 340 
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A-Mapping, 258 
Design 

centering, 591 
for testability, 554 
See also DFT 
hierarchical. All 
latency insensitive, 144 
nominal, 587 
platform-based, 99 
programmable, 689 
re-use, 347-348, 362 
RTL, 664 
rule checking, 493 
space exploration, 161 
worst-case, 563 
Device 

characterization, 285 
measurement, 287 
modeling, 285, 303 
sizing, 276 
DFT, 554, 642, 665 
Digital Signal Processor, 677 
See also DSP 
Direct sensitivity, 347 
Divisor 

double-cube, 227 
kernel, 184-185 
multiple-cube, 228-229 
single-cube, 228-229 
two-cube, 183 
Don’t care, 182 
observability, 185 
DRC, 493, 661 

Driving-point admittance, 396, 634 
DSM, 143, 146 

DSP, 95-96, 101, 107, 130, 649, 677 
Dynamic logic, 186 
Dynamic programming, 184 
Dynamic simulation, 348 
Dynamic tuning, 348 
Elmore delay, 297, 369, 510 
Embedded system, 130, 649, 675 
application, 98, 159 
memory, 95 
optimization, 99 
power analysis, 99, 129 
power optimization, 98, 141 
software, 98 

Enclosure checking, 493 

Equivalence checking, 3, 5, 29, 65, 637, 645, 667, 
684 

ESPRESSO, 182, 205 
ESPRESSO Exact, 208 
Estimation 
capacitance, 123 
critical path, 124 
power, 122 



supply voltage, 124 
Event-driven simulation, 303, 305 
Extraction 

gate-level model, 279 
MIS, 195 
model, 270 
Factoring, 233 
MIS, 197 

False negative, 5, 11, 30 
False path, 199, 348 
analysis, 552 
FastCap, 367 

FastHenry, 367, 403, 407, 635 
Fault 

path delay, 556 
stuck-at, 568 
transition, 556 
untestable, 568 
FLEXWARE, 100 
Floorplanning, 470, 648, 676 
definition, 479 
representations, 480, 483 
Formal verification, 636, 654, 679, 684 
HDD, 18 
diagnosis, 8, 17 

equivalence checking, 3, 5, 17, 29, 645, 667, 679, 
684 

model checking, 4, 640, 646 
rectification, 7, 17 
FPGA, 181, 184, 186 
architecture, 693 
CAD, 689 

lookup-table based (LUT), 235 
placement, 692 
routing, 692 
synthesis, 692 
technology mapping, 235 
Xilinx, 689 

Frequency domain, 452 
Frequency range, 455 
FSM, 185 
equivalence, 8 
power up, 186 
Function 
convex, 296 
flexibility, 186 
multiple-valued, 206 

multiple- valued input binary- valued output, 205 
Fundamental cutset, 306, 310 
Fundamental loop, 305-306, 310 
Garbage collection, 343 
Gate array, 181, 469 
Gate 

combinational, 296 
complejt, 237, 296 
simple, 237 

Geometric operations, 490 
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Global flow, 183, 217-218, 647, 683 
GMRES, 406 
GOALIE, 473, 489, 648 
Gordian, 469, 636, 647, 666 
complexity, 504 
results, 504 
Gradient 

-based optimization, 348 
conjugate, 357 
projected, 355 
time-domain, 347 
Graph-mapping, 252, 254 
GRASP, 9, 73-74, 634, 646, 667, 693 
Group partial separability, 362 
Hardware-software co-design, 98 
Harmonica, 374, 388 
Harmonic balance, 384 
Hazard 
dynamic, 121 
High-level design, 664 
HYPER, 97 
HYPER-LP, 117 
IC-CAP, 272 
Inductance 
extraction, 403, 635 
partial, 404 
Input patterns, 65, 348 
Instruction level parallelism, 159 
Interconnect 
analysis, 368, 403, 634 
delay, 393, 634 
RLC, 433 
IRSIM, 123, 278 
Iterative solver, 406 
JiffyTune, 271, 347 
Kernel, 228 
Kirchhoff 
current law, 405 
voltage law, 406 
KISS, 185 

Krylov methods, 368, 372, 374, 406 
Lagrange multiplier, 356 
Lagrangian 

augmented, 347, 349, 356 
A-Mapping, 257 
LANCELOT, 347, 351 
Latch, 296 
level-sensitive, 597 

Levenberg-Marquardt optimization, 351 
Library, 252 
Limit cycle, 306 
Linear system 
solution, 357 
Logical effort, 654 

Logic synthesis, 65, 655, 666, 678, 683 
for testability, 555 
technology mapping, 344 



Loop 

fundamental, 305-306, 310 
LSS, 183 

Macromodels, 433 
Manufacturability, 362, 556 
Mapping graph, 184, 252 
reduced, 253 
Markov chain, 186 
Match, 254 
covered by, 255 
Matching, 254 
Matrix polynomial, 456 
Maximal independent set heuristic, 212 
McBoole, 205 
Media processor, 677 
programmable media chip, 677 
Meet-in-the-middle, 107 
Microcode compiler, 107 
Microprocessor 
MP98, 664 
PDLX, 154 
power models, 130 
Microwave simulation, 383, 635 
MIMOLA, 94 

Minimax optimization, 347, 350, 358 
Minimization 
MINI, 181 
PLA, 181 

Quine-McCluskey, 182 
two-level, 182 
Minmax formulation, 277 
Minos, 351 
MIS, 184, 191, 640 
algebraic decomposition, 197 
extraction, 195 
factoring, 197 
phase assignment, 197 
resubstitution, 196 
simplification, 198 
timing optimization, 199 
Miter, 11,66 

Mixed integer continuous problem, 362 
Mixed-signal systems, 275 
MMIC, 373, 383 
MNA, 435 

Model checking, 4, 8, 637, 640, 646 
bounded, 10 
explicit, 8 
symbolic, 8 
Model 

closed form, 455 
extraction, 270 
fitting, 270-272 
71 , 370-371 
reduced-order, 456 
Modified Nodal Analysis, 435 
Moments, 441 
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Monolithic Microwave Integrated Circuit, 383 
See also MMIC 

MOSES security processing architecture, 664 
MOSFET, 352 
model, 297 
MOTIS, 277 

Multi-frontal methods, 357 
Multiple fault 
path delay, 577, 582 
stuck-at, 579-581 
stuck-open, 579 
testable design, 582 
Multiplication 
strength reduction, 122 
Multiplier 
Lagrange, 300 
Multiport, 433 
Netlist 

as incidence structure, 480 
connectivity matrix, 502 
topology matrix, 501 
Network, 185 
Boolean, 185 
depth, 184 
pruning, 343 
Network flow 
augmenting path, 241 
max-flow min-cut theorem, 241 
NMOS, 565 
Noise, 413, 452 
analysis, 413, 453 
covariance matrix, 425 
flicker, 417, 453 
linear time- varying circuits, 413 
models, 416 
nonlinear circuits, 413 
numerical, 357 
shot, 417, 452 
simulation, 413 
spectral density, 452 
spectrum, 455 
thermal, 417, 452 
white, 452 
OASYS,271 
Objective function 
quadratic, 502 
OPASYN, 270 
Optimization 
circuit, 347 
delay, 235 
for power, 118 
global, 503 
gradient-based, 348 
group partial separability, 362 
Levenberg-Marquardt, 351 
minimax, 347, 350, 358 
Minos, 351 



numerical methods, 277 
semi-infinite, 362 
sequential, 185 
trust region, 354 
Optimum 
global, 273, 299 
local, 273, 299 
Ordering 
topological, 341 

Orthogonalization of concerns, 145 
Oscillation, 306 
prevention, 306 
Padd approximation, 455 
Pad^ expansion, 372 
Parameter 
extraction, 285, 288 
optimization, 288 
Parasitic extraction, 403, 491 
Partial match, 262 
Partitioning 

exhaustive slicing optimization, 503 
hardware-software, 99 
min-cut, 521 
network-flow based, 521 
recursive, 503 
shape functions, 504 
Passivity, 373, 435 
proof, 439 
Path 

blocked, 568 
critical, 300 
false, 568 
Pattern, 254 
distributed, 259 
D-pattem, 259 
factored, 259 
F-pattem, 259 
matching, 279 

Performance specification, 586 
PERT, 551 
Phase assignment 
MIS, 197 
PHIDEO, 97 

Physical design, 276, 467, 666 
Piecewise approximation, 304 
Pi-model, 513 
equivalent, 513 
load, 397, 634 
PLA, 205 
Placement, 655 
incremental, 475 
loose and stable removal, 641 
Positional Cube Notation, 208 
Posynomial, 270, 299, 348 
Power, 649 
analysis, 649, 656 
embedded system, 129 
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software, 130 
consumption, 118 
CMOS, 118 
reducing, 118 
voltage dependent, 119 
estimation, 99, 122, 186 
software, 139 
measurement, 131 
optimization, 99, 118, 637, 649 
embedded system, 141, 665 
software, 141 
reduction, 118 
Precedence graph, 109 
Preconditioner 
block diagonal, 407 
Schnabel-Eskow, 357 
PRIMA, 373, 438, 635 
Prime implicant, 207 
Process 

disturbances, 563 
Lanczos, 459 
patient, 149 
stochastic, 452 
Program 
convex, 299 

geometric, 274, 277, 299 
posynomial, 270, 299 
Projected gradient, 355 
Property verification, 637, 684 
Protocol 

latency insensitive, 149 
PROUD, 469 
PVL, 437 
algorithm, 455 
reduction, 456 

Pyramid shaped nand gates, 300 
Quadratic optimization 
constrained, 502 
Quadratic placement 
bipartitioning, 499 
Rectangle covering, 185 
Rectangle packing problem, 535 
sequence-pair, 540 
Recursive learning, 183 
Reduced order model, 635 
Redundancy, 181 
addition and removal, 183 
Region-oriented scanline, 491 
Register file 
centralized, 160 
distributed, 160 
Relay station, 144, 149, 151 
Reset sequence, 186 
Resistance 
skin effect, 409 
source-to-drain, 297 
Resistive shielding, 401, 634 



Resubstitution 
MIS, 196 
Resynthesis, 185 

Retiming, 553, 597, 615, 678, 684 
area-delay curve, 627 
Bellman-Ford algorithm, 620 
constrained min-area, 615 
period edge pruning, 624 
min-area, 615 

minimum cost circulation, 627 
min-period, 615 
predecessor heuristic, 621 
relaxation, 619 
W and D matrices, 623 
RF 

circuit, 276, 373 
design, 93, 276, 367, 649 
simulation, 373, 383, 635 
RICE, 442 

RLC interconnect, 433 
Routing, 472, 655 
incremental, 475 
touch and cross, 641 
SAT, 5, 9, 73, 182, 637,640,667 
Satisfiability, 73, 556 
See also SAT 
Scheduling, 107, 162 
Schnabel-Eskow preconditioner, 357 
Segment tree structure, 494 
Semi-infinite optimization, 362 
Sensitivity, 347 
adjoint, 347, 352 
calculation, 300 
direct, 347, 352 
Sequence-pair, 540 
Shannon’s expansion, 577, 582 
Shell, 144, 152 
Shooting method, 375, 384 
Signal processing 
power reduction in, 125 
Silicon compiler, 95, 107, 677 
Simplification 
MIS, 198 

Simulated annealing, 125, 277, 299, 479, 545, 636, 
648, 676 

acceptance probability, 480, 485 
algorithm, 480 
entropy, 482 
finalization, 486 
initialization, 485 
moves, 480 
schedule, 485 
score function, 481 
selection probability, 480, 484 
state space, 480 
Simulation, 313, 685 
circuit, 303, 347, 383 
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dynamic, 348 
event-driven, 303, 305 
fast, 278 
fault, 69 

frequency-domain, 383 
hardware accelerator, 337 
hardware-software co-simulaton, 99 
microwave, 383 
RF, 383 

switch-level, 337 
symbolic, 4, 18 
system-level, 455 
temery, 186 
SIS, 184 
Sizing, 185 
Skin effect, 409 
Slew-rate, 277 
SMO formulation, 597 
SoC, 93, 100, 275 
test, 555 
Software 

power analysis, 130 
power and energy, 131 
power estimation, 139 
power optimization, 141 
Solution space 
P-admissible, 536 
rectangle packing problem, 536 
Solving linear systems, 357 
Specification 
performance, 586 
SPECS2, 270 
SPECS, 303, 347, 352 
Spectre, 374, 388 
SpectreRF, 374, 635 
SPFD, 186 

Sphere of influence, 306 
SPICE, 123, 276, 288, 307, 313, 348, 374-375 
413, 443, 451,556, 565, 660 
State assignment, 185 
Static timing analysis, 299, 348 
Steepest descent, 277 
Stochastic differential equations, 419 
Subnetwork 
series-connected, 296 
Substitution 

multiplication to addition, 122 
Symbolic analysis 
switch-level, 338 
Symbolic model checking, 8 
Synthesis 
analog, 635 

asynchronous circuit, 182 
FPGA, 692 
interface, 99 
logic, 637 

multi-valued circuit, 182 



op amp, 313 

technology independent, 181 
System-level Design, 663 
System on a Chip, 100 
See also SoC 
Table models, 303 
Tagged signal model, 149 
TECAP2, 270 

Technology mapping, 181, 184, 255 
Tellegen’s theorem, 352 
Temporal logic, 4 
Test, 554, 665 
built-in self, 554 
compaction, 665 
defect based, 556 
IDDQ, 556 
invalidation, 576 
resource partitioning, 554 
scan chain, 555 
SoC, 555 

stuck-at fault, 65, 568 
synthesis, 554 
untestable fault, 65, 568 
Testability 
partial scan, 665 
robust, 575-576, 579 
Testbench, 685 
Throughput 

digital computation, 119 
TILOS, 270, 295 
heuristic, 299 
Timing, 273, 551 
analysis, 551 
algorithm, 569 
block oriented, 552 
functional, 551 

static, 296, 348, 567, 647, 655, 662, 684 
closure, 146 
driven layout, 641 
optimization in MIS, 199 
simulation, 277, 303 
Tolerance body, 586 
Tool 

AS/X, 352 
BooleDozer, 181 
CATHEDRAL II, 96, 107 
CHESS, 100 
CINDERELLA, 100 
Design Compiler, 181, 683 
ESPRESSO, 182, 205 
ESPRESSO Exact, 208 
FASTHENRY, 403 
FLEXWARE, 100 
GOALIE, 473, 489, 648 
GORDIAN, 469, 666 
Harmonica, 388 
HYPER-LP, 117 
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IC-CAP, 272 
IRSIM, 278 
JiffyTune, 271, 347 
KISS, 185 

LANCELOT, 347, 351 
LSS, 181 
McBoole, 205 
MIMOLA, 94 
MIS, 183, 191 
PRIMA, 373, 438, 635 
PROUD, 469 
RICE, 442 
SIS, 183 
SPECS2, 270 
SPECS, 303, 347, 351 
SPICE, 313, 383, 565 
System Architect Workbench, 94 
TECAP2, 270 
TILOS, 295 
TRANALYSE, 337 
TRANGEN, 665 
UTMOST, 272 
Trade-off 
area-delay, 185 
TRANALYZE, 271 
TRANGEN, 665 
Transfer function, 454 
Transformation 
algebraic, 655 
architectural, 126 
VLSI circuits, 120 
Transistor 

bidirectional model, 337 
channel-connected component, 341 
extraction, 491 

sizing, 270, 273, 347, 635, 686 
equation-based, 273 



simulation-based, 273 
Transition 
spmious, 121 
Transmission lines, 383 
Tree/link formulation, 303 
Tree mapping, 184 
Trust region, 347, 354 
Truth table, 182 
Tuning 
dynamic, 348 
static, 348 
Ugate, 253 
factor-free, 262 
User interface, 347, 351, 359 
UTMOST, 272 
Variability, 557 
within-die, 557 
Variable accuracy, 303 
Variable lifetime, 114 
Vertex coloring, 113 
Very Large Instruction Word, 159 
See also VLIW 
VLIW,96, 100-101,159 
Voltage 

estimation of, 124 
piecewise approximate, 304 
supply, 126 
Wave-pipelining, 597 
Wire length minimization 
quadratic, 636 
Worst-case 
analysis, 557 
distance, 588 
parameter set, 588 
Yield, 557, 563 
optimization, 591 
parametric, 586 
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